Mareh 1935 


SFIOMETRICS 


Vol. Ii No. I = 
JOURNAL OF THE BIOMETRIC SOCIETY |, 


Multiple Range and Multiple F Tests David B. Duncan 


Further Contributions to the Theory of Paired Comparisons 
M. G. Kendall 


Comparative Sensitivity of Pair and Triad Flavor 
Intensity Difference Tests J. W. Hopkins and N. T. Gridgeman 


The Description of Genic Interactions in Continuous 
Variation B. I. Hayman and K. Mather 


Quantitative Studies in Diphtheria Prophylaxis: 
An Attempt to Derive a Mathematical Characterization 
of the Antigenicity of Diphtheria Prophylactic L. B. Holt 


Prediction Equations in Quantitative Genetics Alan Robertson 


Determining the Fruit Count on a Tree by Randomized 
Branch Sampling Raymond J. Jessen 


A Further Note on Missing Data H. W. Norton 


= 
—-, 
— 
| 


? 
i 


The Biometric Society 


G2 


FouNDED BY THE BroMETRICS SECTION OF THE AMERICAN STATISTICAL ASSOCIATION 


TABLE OF CONTENTS 


Multiple Range and Multiple F Tests . . . David B. Duncan 1 
Further Contributions to the Theory of Paired 


Comparative Sensitivity of Pair and Triad Flavor Intensity 
Difference Tests . . J. W. Hopkins and N. T. Gridgeman 63 


The Description of Genic Interactions in Continuous 
Se B. I. Hayman and K. Mather 69 


Quantitative Studies in Diphtheria Prophylaxis: 


An Attempt to Derive a Mathematical Characterization 
of the Antigenicity of Diphtheria Prophylactic . . L. B. Holt 83 


Prediction Equations in Quantitative Genetics . . Alan Robertson 95 


Determining the Fruit Count on a Tree by Randomized 


Gamoling Raymond J. Jessen 99 
A Further Note on Missing Data. . ...... H.W. Norton 110 
113 


March 1955 Volume 11 


| 

| | 

| 
Number 1 ee 


Material for Biometrics should be addressed to Miss Gertrude Cox, Institute of 
Statistics, Box 5457, Raleigh, North Carolina, except that authors residing i in one of 
the following organized regions can expedite the handling of their papers by sub- 
mitting them to the Assistant Editor for that region. 


British Region: Dr. D. J. Finney, Dept. of Stat., Univ. of Aberdeen, Aberdeen, Scot- 
land; Australasian Region: Dr. I. A. Cornish, University of Adelaide, Adelaide, 
Australia; French Region: Dr. Georges Teissier, Faculte des Sciences de Paris, 1 rue 
V. Cousin, Paris, France. 


Material for Queries should go to Professor G. W. Snedecor, Statistical Laboratory, 
Towa State College, Ames, Iowa. 


Articles to be considered for publication should be submitted in triplicate. 


THE BIOMETRIC SOCIETY 


General Officers 


President, W. G. Cochran; Secretary-Treasurer, C. I. Bliss; Council, H. C. Batson, 
L. L. Cavalli-Sforza, Georges Darmois, C. W. Emmens, D. J. Finney, Sir Ronald 
Fisher, J. O. Irwin, Arthur Linder, P. C. Mahalanobis, Donald Mainland, Leopold 
Martin, A. M. M lood, C. R. Rao, Georges Teissier, J. W. Tukey, Frank Yates, 
W. J. Youden. 


Regional Officers 


Eastern North American Region: Regional President, D. B. Duncan; Secretary-Treas- 
urer, A. M. Dutton. British Region: Regional President, R. R. Race; Secretary, 
E. C. Fieller; Treasurer, A. R. G. Owen. Western North American Region: Regional 
President, W. J. Dixon; Secretary-Treasurer, Elizabeth Vaughan. Australasian Re- 
+ Regional President, Ek. J. Williams; Secretary, W. B. Hall; Treasurer, Miss 
Ditchburne. French Region: Regional President, Georges Darmois; Secretary- 
cei Daniel Schwartz. Belgian Region: Regional President, Paul Spehl: 
Secretary, Leopold Martin; Treasurer, Claude Panier. Italian Region: Regional 
President, C. Barigozzi; Secretary, L. L. Cavalli-Sforza; Treasurer, R. Scossiroli. 


National Secretaries 


Denmark, N. F. Gjeddebaek; The Netherlands, E. van der Laan; India, V. G. Panse; 
Germany, Maria-Pia G reppert; Japan, M. Hatamura; Switzerland, Arthur Linder; 
Sweden, H. O. A. Wold; Brazil, Americo Groszmann. 


Editorial Board 
Biometrics 
Editor: Gertrude M. Cox; Associate Editor: J. W. Hopkins; Assistant Editors and 
Committee Members: C. I. Bliss, Irwin bg pen E. A. Cornish, W. J. Dixon, Mary 
Elveback, Ralph Bradley, D. J. Finney, S. Lee Crump, Leopold Martin, K. R. Nair, 


Horace W. Norton, G. W. Snedecor and Georges Teissier. Managing Editor: Sarah 
P. Carroll. 


The Biometric Society is an international society devoted to the mathematical and statistical 
aspects of biology and welcomes to membership biologists, mathematicians, statisticians and others who 
are interested in its objectives. Through its regional organizations the Society sponsors regional and 
local meetings. National secretaries serve the interest of members in Denmark, The Netherlands, India, 
Germany, Japan, Sweden and Brazil and there are many members “‘at large’. Dues in the Society for 
1955 for residents of the Western Hemisphere are as follows: Full membership including subscription to 
Biometrics is $7.00. Members of the Biometrics Section of the American Statistical Association who 
subscribe to the journal through that organization may become members of The Biometric Society on 
the payment of $3.00 annual dues. For members in other parts of the world, full membership including 
subscription to Biometrics is $4.50. except that members who subscribe to the journal through the 
American Statistical Association pay annual dues of $1.75. Information concerning the Society can be 
obtained from the Secretary, The Biometric Society, Drawer 1106, New Haven 4, Connecticut, U.S.A. 

Annual subscription rates to non-members are as follows: For American Statistical Association 
Members, $4.00; for subscribers, non-members of either American Statistical Association or The Bio- 
metric Society, $7.00. ee should be sent to the Managing Editor, Biometrics, P. O. Box 
5457, Raleigh. North Carolina, U.S. 


Entered as second-class matter at the Post Office at New Haven, Conn., under 
the Act of March 3, 1879. Additional entry 2t Richmond, Va. Business| Office, 
52 Hillhouse Ave., New Haven, Conn. Biometrics is published quarterly—in March, 
June, September and December. 


ae 


MULTIPLE RANGE AND MULTIPLE F TESTS* 
Davin B. Duncan** 


Virginia Polytechnic Institute 
Blacksburg, Virginia 


1. INTRODUCTION 


The common practice for testing the homogeneity of a set of n 
treatment means in an analysis of variance is to use an F (or z) test. 
This procedure has special desirable properties for testing the homo- 
geneity hypothesis that the n population means concerned are equal. 
An F test alone, however, generally falls short of satisfying all of the 
practical requirements involved. When it rejects the homogeneity 
hypothesis, it gives no decisions as to which of the differences among 
the treatment means may be considered significant and which may not. 

To illustrate, Table I shows results of a barley grain yield experiment 
conducted by E. Shulkcum of this Institute at Accomac, Virginia, in 
1951. Seven varieties, 4, B, --- , G, were replicated six times in a 
randomized block design. The F ratio (in section b) for testing the 
homogeneity of the varietal means is highly significant. This indicates 


that one or more of the differences among the means are. significant 
but it does not specify which ones. 


TABLE I. BARLEY GRAIN YIELDS IN BUSHELS PER ACRE 


a) Varietal Means Ranked in Order 


A F G D C B E 
49.6 58.1 61.0 61.5 67.6 71.2 71.3 
b) Analysis of Variance 
Source af. m.s. F 
Between varieties 6 366.97 4.61°" 
Between blocks 5 141.95 
Error 30 79.64 


ec) Standard Error of a Varictal \fean 


ta = V79.64/6 = 3.643 (m = 30) 


The problem we wish to consider is that of testing these differences 
more specifically. Several test procedures have been proposed for 


*Sponsored by the Office of Ordnance Research, U.S. Army, under Contract DA-36-034-OR D-1477, 
Technical Reports Nos, 3 (June 1953), 6 (September 1953) and 9 (July 1954). 
**Now at the University of Florida. 


3 
10 
d 
2. 
or 
ho 
on 
ng 
he 
be 
A. 
io~ 
ler 
ce, 1 
sh, 


2 BIOMETRICS, MARCH 1955 
answering this problem. The simplest of these is one which is often 
termed the least-significant-difference (or L.S.D.) test. This has devel- 
oped from a brief discussion of the problem by R. A. Fisher (9, section 
~24) and is described in detail by several authors, for example, Paterson 
(14, pp. 38-42) and Davies (4, seetion 5.28). In this test, the difference 
between any two means is declared significant, at the 5% level, say, 
if it exceeds a so-called least significant difference ~/2 ts, (t being the 
5°% level significant value from the ¢ distribution), and provided also 
that the F test for the homogeneity of the n means involved is significant. 
If the F test is not significant, none of the differences is significant 
irrespective of its magnitude relative to the least significant difference. 

Many other tests have also been proposed for solving this problem, 
including several put forward within the last year or two. Further 
tests are being developed at the present time. Originators of these, 
not to mention all, include D. T. Sawkins (18), D. Newman (12), 
D. B. Dunean (5-8), J. W. Tukey (21-23), H. Scheffé (19), M. Keuls (10), 
S. N. Roy, R. C. Bose (17), H. O. Hartley (25), and J. Cornfield, M. 
Halperin, 5. Greenhouse (3). Unfertunately, these tests vary consider- 
ably and it is difficult for the user to decide which one to choose for any 
given problem. 

One objective of this paper is to consider several of the procedures 
which have been proposed and to illustrate their basic points of differ- 
ence, using a geometric method with simple cases involving only three 
means. A second objective is to present certain simple extensions of 
the concepts of power and significance which are useful in analyzing 
these procedures. The development of the simple case examples and 
the latter general concepts will point the way to a clearer evaluation 
of the relative properties and merits of the procedures in general and 
should help the user in making a choice among the available procedures. 
The final objective is to present a new multiple range test (8) which 
combines the features considered to be the best from the previously 
proposed tests. 

2, THE NEW MULTIPLE RANGE TEST 


Before discussing the general problem in more detail, it may be 
helpful to look ahead at an example of the application of one of the 
tests. An example of the proposed new test will be used for this purpose. 
This new multiple range test, as it will be termed, combines the simplicity 
and speed of application of a test proposed by Newman (12) and Keuls 
(10) with most of the power advantages of the multiple comparisons 
test previously proposed by the author (6, 7). For the example, we 
shall consider the application of a 5%% level test to the varietal yield 
means in Table I. 


MULTIPLE F TESTS 


JO UO posUq [Bloods Bulk, ) 


€ 
ste 
Lek 
Zt 
Zt 
SEE 
Ste 


cO 
oc F 
60°9 


tere 
Ste 
St'¢ 
Ste 4 
9¢°¢ 
cO F 
60°9 


Lt’ OF 96° € cee | cre | coe | | 001 
| tree OFS | | soe | sos | 09 
LE va Ine | 96% | 
| 16% | SI 
| | sre | | 
| cee | ore | cl 
LEE see faze [sre | coe 
cee | ze | 908 
SFE | € Zoe | ue 
zee | Gee | 8 
F | Of | OS F | OCF € 
60°9 60°9 | 609 | 60°9 | 60°9 Zz 
0z SI ol 6 8 9 ¢ z #4 
\ 


FONVU AId «MAGN V SADNVU GAZILNAGOLS LNVOIMINDIS ‘Il ATAVL 


3 
| ON | 

| 

_| 
| 
| | 


~ t= t= 


ere 
SS DHDW 


0°06 


™ ~ 
4 BIOMETRICS, MARCI 1955 
1! 
| 
| | 
| 5 
| | a= | 
= || ea 
| 
| 
a= 
| 
| j 
| — 
| 
| | 
| 
j 
| 
i| | 
| 
| 4 coro SnnSS Nee SS 
a= 
é 
\| | 
| — AANA N 233s 8 
Joa 


MULTIPLE F TESTS 


5 


The data necessary to perform the test are: (a) the means as shown 
in Table I; (b) the standard error of each mean, s,, = 3.643 and (ec) the 
degrees of freedom on which this standard error is based, n. = 30. 

First, a table (Table IT) of special significant studentized ranges 
for a 5% level test is entered at the row for n. = 30 degrees of freedom, 
and significant studentized ranges are extracted for samples of sizes 
p = 2,3,4,5,6and7. The values obtained in this way are 2.89, 3.04, 
3.12, 3.20, 3.25 and 3.29 respectively. (Table III shows the significant 
studentized ranges which would be used for a 1°% level test.) 

The significant studentized ranges are then each multiplied by the 
standard error, s,, = 3.643, to form what may be called shortest significant 
ranges. The shortest significant ranges R, , R, , --- , R; are recorded 
at the top of a worksheet as shown in Table IV. 

As a final preparatory step it is convenient to display the means in 
ranked order from left to right, spaced so that the distances between 
them are very roughly proportional to their numerical differences. 
This may be done on the worksheet immediately under the shortest 
significant ranges as in Table IV. The lines underscoring the means 
indicate the results and are added as the test proceeds. 


TABLE IV. WORKSHEET 


a) Shortest Significant Ranges 


Pp: (2) (3) (4) (5) (6) (7) 

| 10.53 11.07 11.37 11.66 11.84 11.99 
b) Results 

Varieties: A G D 

Means: 49.6 5S. 1 61.0 61.5 67.6 


Note: Any two means not wnderscored by the same line are significantly 
different. 


Any two means underscored by the same line are not significantly different. 


We now set out to test the differences in the following order: the 
largest minus the smallest, the largest minus the second smallest, up 
to the largest minus the second largest; then the second largest minus 
the smallest, the second largest minus the second smallest, and so on, 
finishing with the second smallest minus the smallest. Thus, in the 
case of this example the order for testing is: h} — A, hb — F, hb — G, 
A, 
C-—F,C-—G,C A,D-F,D-—G;G — A,G — F; and 
finally F — A, 


BIOMETRICS, MARCIT 1955 


With only one exception, given below, each difference is significant 
?f tt exceeds the corresponding shortest significant range; otherivise it is 
not significant. Because l) — A is the range of seven means, it must 
exceed R; = 11.99, the shortest significant range of seven means, to be 
significant; because ) — I is the range of six means, it must exceed 
R, = 11.84, the shortest significant range for six means, to be significant; 
and soon. Exception: The sole exception to this rule is that no difference 
between two means can be declared significant tf the two means concerned 
are both contained in a subset® of the means which has a non-significant 
range. 

Because of this exception, as soon as a non-significant difference is 
found between two means, it is convenient to group these two means 
and all of the intervening means together by underscoring them with 
a line, as shown for the means {G, D, C, B, FE}, for example, in Table IV. 
The remaining differences between all members of a subset underscored 
in this way are not significant according to the exception rule. Thus 
they need not, and should not, be tested against shortest significant 
ranges. 

The details of the test are as follows: 

1) E —A = 21.7 > 11.99; thus F — A is signifieant. 

2) E — F = 13.2 > 11.84; thus 2 — F is significant. 

3) EF —G = 103 < 11.66; thus 2 — Gis not significant, and hence 
E-D,E-C,E -B;B-—G,B—- D,B-—C;C — G,C — D;and 
D — G are not significant by the exception rule. These results are all 
denoted by drawing the line under the subset {G, D, C, B, EF}. 

4) B — A = 21.6 > 11.84; thus B — A is significant. 

5) B — F = 13.1 > 11.66; thus B — F is significant. 

6) B —G,B—- D,B C;C —G,C — D;and D — G are not sig- 
nificant from step 3. No line need be added to show this because of 
the line under {G, D, C, B, F} already. 

7) C—A = 18.0 > 11.66; thus C — -1 is significant. 

8) C — F — 9.5 < 11.37; thus C — F is not significant; and C — G, 
C — D;D — F, D — G;andG — F are not significant by the exception 
rule. These results are all denoted by drawing the line under the sub- 
set {F, G, D, C}. 

9) D— A = 11.9 > 11.37; thus D — AL is significant. 

10) D — F is not significant from step 8 and ) — G is not significant 
from step 3 or 8. 

11) G — A = 114 > 11.07; thus G — A is significant. 

i2) G — F is not significant from step 8. 


*The term su/set will be used to include the complete set where necessary, as is the case here. 


MULTIPLE F TESTS 7 


13) F — A = 85 < 10.53; thus F — A is not significant. The 
result. is denoted by drawing the line under {A, F}. 

Each of the steps can be done almost, by inspection and the complete 
test takes very little time. All that is necessary for a complete recording 
of the result is the array of means with the lines underneath, together 
with the brief statement giving their interpretation, as shown in sec- 
tion b of Table IV. ) 

In practice there is a short cut which can be used repeatedly to 
good advantage, especially when the number of means is large. Instead 
of starting by finding the difference # — A, subtract the shortest 
significant range for seven means from the top mean I. This gives 
71.3 — 11.99 = 59.31. Since A and F are each less than 59.31, it 
follows that ’ — A and FE — F are both significant. This is so because 
the shortest significant ranges R,, become smaller with decreases in the 
subset size p. This takes care of steps 1 and 2 in one operation. The 
same idea can be used repeatedly throughout the complete application 
and may often eliminate many steps at a time especially in a case with 
a large number of means. 

The foregoing provides a brief introduction to many of the featutres 
of the problem involved as well as an illustration of the proposed new 


multiple range test. We now begin afresh considering matters in more 
detail. 


3. GENERAL ASSUMPTIONS AND DECISIONS 


In the general problem we are given a sample of observed means, 
m,, Mz, -°+*,m,, Which are assumed to have been drawn independently 
from n normal populations with “true”? means, 4; , We, Ma respec- 
tively, and a common standard error o,,. This standard error is un- 
known, but there is available the usual estimate s,, , which is independent 
of the observed means and is based on a number of degrees of freedom, 
denoted by n, . (More precisely, s, has the property that n.s,/o, is 
distributed as x° with n, degrees of freedom, independently of 
,m,.) 

In the simplest case, with only two means m, and m, , there are 
three possible decisions. These are: 

1) m, ts significantly less than mz ; 

2) m, and m, are not significantly different; 

3) m, ts significantly less than m, . 

It is convenient to denote these decisions by (1, 2), (12), and (2, 1), 
respectively. The order of the numbers in each pair of parentheses 
indicates the ranking of the means except when underscored, in which 
case the means are not ranked. 


BIOMETRICS, MARCIL 1955 


Tn passing it should be noted that we do not intend to restrict 
consideration, as some writers have done, for example R. E. Bechhofer 
(1), to problems in which the middle decision (1, 2) is eliminated and 
the investigator is obliged to make one of the two positive decisions 
(1, 2) or (2, 1). Problems of this type and their extensions to cases 
involving more than two means may be regarded as special cases of 
the problems treated here in which the significance level is fixed at 
100% instead of the usual 5°% or 1°% level. 

In the case n = 3, with three means, m, , mm. , and m, , there are 
19 possible decisions. These comprise: 

a) Six decisions of the form: ‘‘m, 7s significantly less than m, , m, is 
significantly less than m; , and m, ts significantly less than m, .”’ This 
joint decision may be conveniently denoted by (1, 2,3). The remaining 
five denoted in the same way are (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), 
and (3, 2, 1). 

b) Three decisions of the form: ‘‘m, 7s significantly less than m, 
and m; , but m, and m, are not significantly different from one another.” 
This joint decision may be denoted by (1, 2:3). The remaining two 
denoted in the same way are (2, 1.3) and (3, 1 2). 

c) Three decisions of the form: ‘‘m, and m, are significantly less 
thay m;, , but m, and m, are not significantly different from one another.” 
This one may be denoted by (1; 2, 3) and the remaining two in a similar 
way by (1. 3, 2) and (2:3, 1). 

d) Six decisions of the form: ‘“‘m, zs significantly less than m, , but 
m, and m, are not significantly different from one another, and m, and 
m; are not significantly different from one another.’ This decision may 
be denoted by (1, 2: 3) and the remainder by (1; 3,2), (2) 1.3), (2) 3:1), 


(3, 1, 2), and (3; 2) 1). 
e) One decision stating: “‘m, , m, , and m, are not significantly 
different from one another,’ which may be denoted by (1; 2,3). 
The number of decisions increases very rapidly as n increases. 
In the general case with n means there are n! decisions of the form 


(1, 2, --- , m) with no underscoring, (n — 1)n!/2 decisions of the form 
(1. 2, 3, --- , n) with one pair of means underscored, (n — 2)n!/3! 
decisions of the form (1. 2.3, 4, +--+ , ») with three means underscored, 


- , (n — 2)n! decisions of the form (1; 2, 3,4, --- , n) with two over- 
lapping pair of means underscored, and so on through often large 
numbers of many forms finishing with one decision of the form 


(1, 2, -++ +m) in which all means are underscored with the one line. 
The underscoring has the same interpretation as before, for example 
(1. 2, is the decision that the means m, , , m, are not 


significantly different from one another. 


MULTIPLE IF TESTS 9 


The statements of the respective decisions may alternatively be 
made in terms of the true means, 4; , #2, *** , Wn - The statement, 
“m, is significantly less than m; ,” is equivalent to the statement, 
“u; is less than yu; .”” Thus, the decision (1, 2, 3), for example, implies 
the acceptance of the hypothesis that u, <u. < wu; . The statement, 
“m, and m, are not significantly different,” is equivalent to the state- 
ment ‘“u,; 7s unranked relative to u; ,”’ where this is taken to mean that 
there is insufficient evidence to tell whether yu, is less than, equal to, 
or greater than uw; . Thus the decision (2, 1,3), for example, consists 
of accepting the hypothesis that “u, <u, <u, , but is unranked 
relative to yu, .” 


4. CONCEPTS OF POWER AND SIGNIFICANCE 


4.1 Power Functions. 


In analysing the power of these tests we are first faced with the 
difficulty that none of them, not even in the simplest case involving 
only two means, is a two-decision procedure, whereas a power function 
as defined by Neyman and Pearson (13) is strictly a two-decision-test 
concept. 

In the three-decision test in the simplest case of two means, one 
way of avoiding this difficulty is to group the decisions (1, 2) and (2, 1) 
together as the decision that m, and m, are significantly different, or 
in other words as acceptance of the hypothesis uw, . A convenient 
notation for this decision is (1 # 2). The given three-decision test is 
reduced in this way to a tavo-decision procedure with decisions (1; 2) 
and (1 # 2) and as such may be analysed as an a-level test of uw, = ue 
against the two-sided alternative u, # uw, . The power function ob- 
tained in this way is given by the probability of the decision (1 # 2) 
expressed as a function of the true difference « = uw, — w.. This may 
be conveniently denoted by p(1 # 2), thus 


# 2) = Pidec. (1 ¥ 2) | a’). 


An example of p(1 # 2) is illustrated viii the familiar curve shown by 
the dotted line in Figure 1b. 

Although p(1 # 2) is a most desirable function for measuring the 
properties of a test of u, = wu. against uw, ~ wu. it has a serious weakness 
for measuring the properties of a three-decision test of two means. 
By pooling the probabilities of the two decisions (1, 2) and (2, 1) for 
any given value of the true difference, it combines the probability 
of the correct decision (that 4, or uw. is the higher mean as the truth 
may be), with the probability of the most incorrect decision (that 
u, is the higher mean when in fact yu, is, or that uw. is the higher mean 


10 BIOMETRICS, MARCH 1955 
when in fact , is). A function which combines probabilities of correct 
decisions with probabilities of serious errors in this way, is of no value 
in measuring desirable or undesirable properties. For this reason 
p(1 ¥ 2) will not be used as a measure of power in this problem. It 
has been discussed only because this function is so familiar that other- 
wise readers might have expected to have seen it used. 

A more useful analysis of a three-decision test of two means is one 
which treats it as the joint application of two two-decision tests, namely, 
a test of the hypothesis, u, < mu, against the alternative uw, < mu, , and 
a test of the hypothesis ». < u, against the alternative nu, < u.. This 
type of analysis, which is suggested in a more general form by Leh- 
mann (11, section 11), avoids the difficulties inherent in the p(1 ¥ 2) 
function, and extends readily to cases with more than two means. 

From this point of view, a three-decision test has two power functions 


p(2, 1) = Pidec. (2, 1) | «, 0°] 
and 
p(1, 2) = Pldec. (1, 2) | ¢, 


which are the Neyman-Pearson power functions of the tests of wu. < pe 
and yw. < uw, respectively. Examples of these functions are illustrated 
by the sigmoid and the reverse-sigmoid curves respectively in Figure 1b. 
Each of these functions has the merit that for any given value of the 
true difference «, the function gives the probability of a correct or 
incorrect decision, and it is therefore clear whether the function should 
be as high or as low as possible. For example, p(2, 1) represents the 
probability of deciding that y, is the higher mean. Clearly then, it 
will be desirable for p(2, 1) to be as high as possible for ¢« = yu, — w2 > 0, 
and to be as low as possible for e < 0. 

In the general case of n means we shall use ,P, power functions of 
the form 


p(t, j) P{dec. (i, j) | **° 


where decision (7, j) includes all decisions which rank yu, lower than 4; ; 
and i,j = 1,2, --- Each function p(i, 7) is the Neyman- 
Pearson power function of the test of the hypothesis u4; < yu; against 
the alternative u; < uw; . In general, therefore, p(7, 7) measures the 
probability of a correct decision with respect to u; and y; , over all 
values of the true means for which »n; < yu; , and the,probability of a 
wrong decision over all values of the means for which yp; < yu; . 

This approach is greatly simplified in all tests we wish to consider 
as a result of the reasonable symmetry restriction that all test properties 
be invariant under all n! permutations of the true means. In other 


MULTIPLE F TESTS 11 


words any test we consider must have the same properties for any set 
of values of the means irrespective of the identification of (the varieties 
represented by) the given means. Under these conditions it is necessary 
to investigate only one of the power functions p(?, 7) in order to investi- _ 
gate them all. An example of this is shown by the symmetry of p(2, 1) 
and p(1, 2) in Figure 1b. 


4.2 Significance Levels. 


So far as joint test properties are concerned only a relatively small 
number of significance levels need be considered. These are chosen so 
as to be as few in number as possible and yet have the property that 
once they are fixed at appropriate values, the merits of a test can then 
be judged solely in terms of its individual power functions. 

In the simplest case involving only two means the significance 
levels or maximum type 1 error probabilities of the tests of wu, < pu 
and uw. < mu, considered individually both occur when yu, = m2 and, by 
symmetry, these levels are equal. Because of this, only one significance 
level need be considered for the joint test, and this level may be taken as 


a = P{dec. (1 2) | = wo], 


which is the familiar significance level of the Neyman-Pearson test 
of uw, = me against wu, ~ w.. Given that a is fixed at a the significance 
levels of the individual tests must be 4a, each. 

In further discussion a type 1 error in a test of uw; < yw; , namely 
the decision (j, 7) in cases where un; < u; , may be usefully termed an 
error of wrong ranking or the finding of a wrong significant difference. 
The importance of fixing a at a» may then be said to rest, not so much 
on the fact that the probability of a wrong ranking when pn, — uw. = 0 
has been fixed at a , but on the fact that the probability of a wrong 
ranking at any value of the difference 4, — uw. cannot exceed ay . 

Any test for the case of three means may be regarded as having 
four significance levels of a nature similar to the significance level of a 
two-mean test. Three of these are of the form 


a(1, 2) = maximum P{dec. (1 ¥ 2) | uw. = pw], 


where the decision (1 # 2) includes all decisions which rank yp, above 
or below wv, and the maximization is taken over all possible values of 
the true means , “2 and yw; for which = The level a(1, 2) is, 
moreover, the maximum value of the probability of making a wrong 
ranking of u, and yw, over all possible values of the trué means. The 
remaining two levels of this same form are 


12 BIOMETRICS, MARCH 1955 


a(1,3) = maximum Pidec. (1 3) | 
a(2, 3) = maximum Pldec. (2 3) | us 


Ks), 


’ 


and are the maximum probabilities of making a wrong ranking between 
and yw; and between and yu; in a similar way. 
The fourth significance level involves all three means and is defined as 


a(l, 2, 3) = P{dec. (i, 2, 3) | Ms], 


where the decision (1, 2, 3) includes all decisions which rank at least 
one pair of the means relative to one another. In other words, decision 
(1, 2, 3) includes all the 19 decisions previously listed except decision 
(1, 2, 3). This three-mean significance level is simply the probability 
of finding at least one wrong significant difference between m, , m, 
and m; , that is, of making at least one wrong ranking of any pair of 
the true means , , and 

In the case of four means there are eleven significance levels which 
may be defined in a similar way. Six of these are two-mean significance 
levels of the form 


a(1, 2) = maximum Pi{dec. (1 ¥ 2) | uw, = we], 


where, as before, the decision (1 ¥ 2) includes all decisions ranking 
w, and yp, relative to one another, and the maximization is taken over 
all values of the means , ue , ws and uw, for which = The re- 
maining five two-mean significance levels defined in a similar way are 
a(1, 3), a(1, 4), a(2, 3), a(2, 4) and a(3, 4). 

Four of the levels in this case are three-mean significance levels of 
the form 


a(1, 2,3) = maximum Pidec. (1, 2, 3) | uw: = we = pal, 


where the decision (1, 2, 3) includes all decisions which rank at least 
one pair of the means y, , “2 and yu; relative to one another, and where 
the maximization is taken over all values of the true means for which 
M1 = M2 = wz. The remaining three three-mean significance levels 
similarly definea are a(1, 2, 4), a(1, 3, 4) and a(2, 3, 4). 

Finally there is a single four-mean significance level defined as 


a(1, 2,3, 4) = Pidec. (1, 2,3, 4) | Mal, 


where decision (1, 2, 3, 4) represents all decisions which rank at least 
one pair of the four means relative to one another. In other words 
decision (1, 2, 3,4) includes all decisions except decision (1, 2, 3, 4), 
which, following the previous pattern, is the decision that none of the 
differences among the four means is significant. 


MULTIPLE F TESTS 13 


In a general test of n means, there are ,C, two-mean significance 
levels, ,C,; three-mean significance levels, and so on up to ,C, = 1 
n-mean significance level., A p-mean significance level in general 
represents the maximum probability of finding at least one wrong 
significant difference among p observed means. 

On careful consideration it appears that all* errors of wrong ranking 
in a test of n means can be adequately controlled by fixing these sig- 
nificance levels at appropriate values. The problem of finding a good 
test is then reduced to finding a procedure which optimizes the power 
functions p(z, 7) given that these significance levels are fixed at the 
chosen values. 


4.3 Protection Levels. 


The complement of any p-mean significance level may be termed 
a p-mean protection level, and is the minimum probability of finding 
no wrong significant differences among p observed means. The name 
‘protection level” is suitable in that the level measures protection 
against finding wrong significant differences. 

Thus, in a two-mean test, there is one protection level 


= Pidec. (1, 2) | = = 1 a. 


If the significance level is 5%, for example, the protection level is 95%. 
In a three-mean test, there are three two-mean protection levels 
y(1, 2), y(1, 3) and y(2, 3), where, for example, 


y(1, 2) = minimum P{dec. (1, 2) | 4. = #2] = 1 — a(1, 2) 


and decision (1; 2) includes all decisions for which u, and yu, are not 
ranked relative to one another. In addition there is one three-mean 
protection level 


y(1, 2, 3) = Pidec. (1, 2, 3) | w: = we = ws] = 1 — a(1, 2, 3). 


In a general test of n means there are ,C, p-mean protection levels 
of the form 


Ay) 
minimum P{dec. (a, » G2 °*° Hay] 
where p = 2, 3, --- , n, each one being the complement of the corre- 


sponding significance level. The symbols a, , a, -- , a, stand for 
the subscripts identifying the particular set of p means concerned. 


#See also comments on class 2 protection levels in section 5.4.4, 


14 BIOMETRICS, MARCH 1955 


(Thus decision (a1, a2, represents the decision that there 
are no significant differences between the observed means m,, , 
Mary 5 Ma,)- 


In further discussion of the controlling of errors of wrong ranking 
it will be somewhat more convenient to think in terms of fixing the 
protection levels of a test rather than in terms of fixing the significance 
levels. . 


4.4 Consistent Protection Levels. 


We now consider the important question: In any test of n means, 
given that y. is an appropriate value for the two-mean protection 
levels, what values y; , ys , --: , Yn Should be regarded as satisfactory 
for the three-mean, four-mean, etc., protection levels, and for the 
n-mean protection level? 

First it should be noted that if a symmetric test with optimum 
power functions were constructed subject only to a restriction on the 
value y2 , the higher order protection levels would almost invariably 
be too low to be satisfactory. For example in the case of four means 
when n, = ©, a test of this type with y. = 95% would be obtained 
by applying six 5% level symmetric normal-deviate tests to each of 
the six differences between the four means. The four-mean protection 
level of this multiple normal-deviate test, as it may be termed, will be 
seen later to be only y; = 79.7%. That is, the minimum probability 
of finding no wrong significant differences between the four means. is 
only 79.7%. This is too low to be satisfactory. The three-mean pro- 
tection levels in the same test have the value y,; = 87.8% which is 
also too low. 

On the other hand, it does not necessarily follow that all of the 
higher order protection levels should be raised to the value y, of the 
two-mean protection level as some writers have implicitly assumed. 
Any increases in the latter levels must necessarily be made at the expense 
of losses in power (that is, of increases in probabilities of type 2 errors), 
and it is most important that the levels be raised no more than is ab- 
solutely necessary. We shall now show that there are good reasons* for 
raising the higher order protection levels only part of the way towards 
the value of the two-mean protection levels. . 

Suppose, for the sake of an example, that a randomized block 
experiment were designed for the purpose of testing (a) the difference 
between two varieties V, and V, , (b) the difference between two 
fertilizers F, and F, and (c) the difference between two insect control 


*See also (5, section 6) and (6, p. 177). 


MULTIPLE F TESTS 15 


spray methods S, and S, . If interactions could be assumed to be 
zero, as might well be reasonable, a good design would be obtained 
by randomizing the four treatment combinations V,F,S, , ViF 2S, , 
V.F,S, and V.F,S, within each block, where V,F,S, , for example, 
denotes the application of fertilizer /; and spray method S, in a plot 
sown with variety V, . If the observed means of these combinations 
are denoted respectively by m, , m, , m; and m, , the varietal, fertilizer 
and spray differences would be measured respectively by the independent 
differences: 


d, = (m, + mz) — (m3 + my) = m, + mz, — Mz — M, 
d, = (m, + m3) — (mz + my) = m, — m, + — my 
= (m, + — (mz + m3) = m, — m, — + mM 


Now, provided that the number, r, of replications and hence the 
number of error degrees of freedom, n. = 3r, were large enough, it 
would be possible to make independent tests of the three given differ- 
ences. Under these circumstances, if, say, a 5% level test of each 
differerice were desired, no reasonable objection could be raised to the 
joint unmodified application of three 5% level tests. The joint use of 
these tests would be just as valid as if the differences were tested in 
three independent and separate experiments. In this joint test, it is 
clear that if the three null hypotheses in the individual tests were 
simultaneously true, which would imply that the true means 4, , uw , 
us , and yw, of the four combinations were all equal, the probability of 
not rejecting this joint hypothesis would be (.95)* = 85.7%. Although 
this value is lower than 95%, it is clearly an implicitly unobjectionable 
result of having chosen a 95% protection level for each of the inde- 
pendent tests. 

Now, the error of wrongly rejecting the hypothesis uw, = uw. = us = uy 
in this type of test is no less serious than the error of rejecting the same 
hypothesis in the type of test under consideration, and a four-mean 
protection level is the probability of not making an error of this kind. 
Hence, it is argued that the objections to the low four-mean protection 
level y, = 79.7% of the 5% level multiple normal-deviate test above 
would be appropriately remedied if the level were raised to y, = 85.7%. 

A similar analogy with two independent 5% level tests of two 
independent differences among three means can be invoked for choosing 
an appropriate value for the three-mean protection levels in the same 
test. This leads to the conclusion that the objection to the low value 
v3; = 87.8% for these levels would be removed if they were increased 
to (.95)? = 90.25%. 


16 BIOMETRICS, MARCH 1955 


The same argument readily generalizes to give the result that the 
value y, = y:' for any p-mean protection level is appropriate in asso- 
ciation with the value y, for a two-mean protection level. The exponent 
p — 1 in these levels is given by the number of independent com- 
parisons which can be specified, or the degrees of freedom, among the 
p means. For this reason the levels y, = y;' may be termed protection 
levels based on degrees of freedom. 

Protection levels of this type have been used in constructing the 
multiple comparisons test (6, 7) and the new multiple range test. In 
the example of section 2 giving a 5% level new multiple range test of 
the seven barley variety means, the values of the protection levels are: 
v2 = 95%, v3 = 90.25%, vs = 85.7%, ys = 81.5%, ve = 77.4% and 
¥: = 73.5%. Since y. = 95%, we know that the probability of finding 
a significant difference between any two means when the corresponding 
true means are equal is definitely less than or equal to 5%. The higher 
order protection level values are in aceord with this property. 

In a similar 5% level test of 101 means, the first seven protection 
level values would be the same and the remainder would get progres- 
sively smaller down to y:o, = (.95)'*° = 0.6% for the 101-mean pro- 
tection level. Despite the independent tests analogy already given, 
the higher order protection levels may appear unduly low unless their 
progressively diminishing importance is fully realized. The appro- 
priateness of these higher order protection levels in general will be 
~ emphasized by a further discussion of the independent tests analogy 
with particular reference to the justification of the 101-mean level 
nia = 0.6%. 

To take a corresponding analogy, suppose that in the course of a 
year’s work, an experimenter has tested 100 separate null hypotheses 
H, , H., --+ , Hioo in 100 independent experiments, and that he has 
chosen a 5% level test in each case. Should he be alarmed over the 
obvious fact that 7f the 100 null hypotheses were simultaneously true 
there has been only a 0.6% chance of not rejecting this joint hypothesis? 
Clearly the answer is no, because it would be illogical to alter any 
given individual test for reasons entirely independent of that test. 

In choosing a 5% level of significance in each test the experimenter 
has implicitly expressed the opinion that there is some a priori chance 
that the respective null hypothesis is not true. It can be stated as a 
general rule that the more one can argue against the truth of a null 
hypothesis on a priori grounds the lower, other things being equal, 
should be the protection level of the test, in order not to waste power 
in detecting the truth of the alternative hypothesis. In choosing a 
5% level test which has a 95% protection level the experimenter is 
implicitly prepared to assume that the a priori probability of the null 


MULTIPLE F TESTS 17 


hypothesis is less than unity and lower than if, for example, he had 
chosen a 1% level test which has a 99% protection level. 

Now, if the individual null hypotheses are independent in the sense 
that their a priori probabilities are independent, and if these probabilities 
are each appreciably less than unity as is implied by the choice of 5% 
levels of significance, the joint a priori probability for p such null 
hypotheses will be the product of the individual probabilities and will 
get less and less as p increases. Hence in the interests of not wasting 
power in detecting the truth of alternatives, it can well be appropriate 
to have lower and lower protection levels for each joint null hypothesis 
as p increases. In the case of the joint null hypothesis that all of the 
100 individual null hypotheses are simultaneously true, for example, 
the a priori probability would be so small that it may be wasteful to 
use more than a very low protection level. 

On extending this line of argument to a full average-weighted-risk 
analysis (24) including considerations of error weight functions and 
more complete Bayes (a priori probability) functions, the appropriate- 
ness of the overall joint test can be fully substantiated. In the full 
analysis the result is found to depend not directly on the independence 
of the Bayes functions of the individual tests, but on a closely related 
property, namely, the additivity of the error weight functions of the | 
individual tests. An interesting more general form of this result, the 
proof and discussion of which will be presented subsequently as a 
separate paper, may be summarized as follows: 


Let T represent the joint test formed by k individual tests 
T,,T.,--:,T7T,. Suppose that the error weight functions of 
the individual tests are additive in the sense that the error 
weight or loss for any joint decision D given any joint hypoth- 
esis H in the joint test 7 is equal to the sum of the error 
weights or losses for the decisions D, , D, , --- , D, given the 
respective hypotheses H, , H, , --- , H, , where the latter are 
individual test decisions and hypotheses forming D and H 
respectively. 

Then it follows, that if each individual test 7; is an opti- 
mum procedure from the point of view of minimizing average 
weighted risk, the joint test 7 is also an optimum procedure in 
the same sense. 


Applying this to our example with 100 independent 5% level tests, 
we can say that since the error losses from one test to the next are 
additive, which is reasonable to assume because of the independent 
nature of the tests, and if each 5% level has been chosen as the best 
level to use for each test considered individually, then all features of 


. 


18 BIOMETRICS, MARCH 1955 


the joint test are optimum including, among many others, the low 
0.6% protection level under special consideration. 

A corresponding argument may be developed concerning the higher 
order protection levels in a test of the differences between n means. 
The larger the number of means involved, the less the a priori chance 
that the means will be homogeneous and the less, therefore, the need 
for a‘high protection level. The 101-mean protection level value of 
0.6% in a 5% level multiple range test of 101 means, for example, 
may well be an optimum value for this level because of the remoteness 
of the possibility that all of the 101 true means are equal. 

Owing to added complexities, it has not been possible thus far to 
prove in complete detail that protection levels based on degrees of 
freedom are exactly optimum in these tests also. However, since such 
protection levels are optimum in sets of independent tests, and since 
their functions are so similar in these tests, it is safe to conclude at 
least that they are close to optimum, and far closer than their only 
proposed rivals, namely, levels which are all equal to the two-mean 
protection level. It therefore seems sound practice to use these levels 
until they can be further improved by a more thorough minimum 
average risk analysis. 

Having defined a set of relations among the values of the p-mean 
protection levels of a test, we therefore need to specify only one of these 
values and the remainder are fixed accordingly. From a practical 
point of view it is most pertinent and useful to define the levels in the 
way adopted in the multiple comparisons test (6, 7) and retained in the 
new multiple range test. The example given for the latter test in 
section 2 is a 5% level test in the sense that its two-mean significance 
levels are 5% and the protection levels are y, = (.95)”"", p = 2,3, --- ,7. 
Likewise in a general test of n means, an a-level test denotes a procedure 
in which the two-mean significance levels are a and the protection 
levels are y, = (1 — a)”', p = 2,3, --+ , n. With the significance 
level of a test defined in this way, all that is necessary in choosing a 
level for a test of a given set of n means is to choose the level which 
would be considered appropriate for a test of the difference between 
any two of the means assuming that the remaining means were not present. 
Provided an appropriate value is chosen for this level, the remaining 
levels in the test are automatically fixed at their correspondingly 
appropriate values. 

é 5. REVIEW OF SEVERAL TESTS 

Comparisons will now be made between several test procedures 

which have been proposed for the given problem. In most of the detailed 


MULTIPLE F TESTS 19 


discussion, consideration will be restricted to the following special 
simplifying conditions: The degrees of freedom for error will be assumed 
to be infinite, i.e.,n. = ©; the standard error of a mean will be assumed 
to be unity, i.e., ¢,, = 1; and the significance level a of each test will 
be 5%, i.e., a = .05. These will be referred to briefly as the special 
conditions n» = ©, ¢,.= 1 anda = .05. This will provide a simple 
and familiar context for bringing out the main points of difference 
between the tests as clearly as possible. These main points are essenti- 
ally unaltered when the special conditions are removed. 


5.1 The Symmetric Three-Decision t Test of Two Means. 


In the case of two means, the best test for choosing between the ° 
three possible decisions is the following familiar rule, which may be 
termed an a-level symmetric three-decision t test: Make the decision (1, 2) 
ifm, — m, < — V 2taSm , the decision (1, 2) if | m, — m,| < V 2ta8m ’ 
or the decision (2, 1) if my — m, > V2ta8, ; where f, is the two-tail 
a-level significant value of f. 

Under the special conditions n. = ©, ¢, = 1, a = .05, the test 
reduces to a 5% level symmetric three-decision normal-deviate test and 
the significant difference V/2t.s, = /2u.c, is the familiar value 
1.9602 = 2.77. 

This test is satisfactory for the case of two means, and it is only 
when we pass on to consider tests involving more than two means that 
the differences arise in proposed test procedures. It is worthwhile, 
however, to consider various special details of an analysis of the three- 
decision normal-deviate test as an introduction to methods of analysing 
the more complex tests. 

(i) Sample Space. A common useful method for representing 
this test graphically is shown in Figure la. In this figure, the horizontal 
straight line provides an example of a one-dimensional sample space 
and is used for plotting the observea “‘fference x = m, — m,. Any 
point on this line representing an obse\ 2d value of x is called a sample 
point. The line is divided into three intervals, « < —2.77, —2.77 < 
x < 2.77, and 2.77 < x. These represent the respective sets of points 
for which the decisions (1, 2), (1, 2) and (2, 1) are made and are termed 
decision regions. It is convenient to denote each region by the same 
symbol, (1, 2), (1,2) or (2, 1), that is used for the corresponding 
decision. 

(ii) Parameter Space. The straight line in Figure la may also be 
used for plotting values of the “‘true” difference, « = u, — uw, , between 
the true means involved. When used in this way, the line provides an 
example of a parameter space, as distinct from its function as a sample 


20 BIOMETRICS, MARCH 1955 


space when used for plotting x. Any point on the line representing a 
given value of « is called a parameter point. 

(iii) Probability Density. In the special case we are considering, 
the probability distribution function f(x; «) of a sample point x (ob- 


=2077 2077 x = mj) - m2 


FIGURE la 


Regions for a 5%-level symmetric three-decision normal-deviate test (0: = V2) 


és -6 -k -2 fe) 2 L 6 8 € 
FIGURE 1b 


Power Functions for 5% Level Symmetric Three-Decision Normal-Deviate Test (0: = +/2) 


served difference) about a given parameter point ¢ (given true difference) 
is given by a normal probability density function with mean e and vari- 
ance 2. For example, when ¢ = 0 this function may be represented by 
the familiar curve shown in Figure la. The curve for any other value 


1.0 

8 
p(1#2) 
4 
2 | 


MULTIPLE F TESTS 21 


of « has the same shape and is located with its center over the given «¢ 
value. 

(iv) Power Functions. The power function p(l, 2) representing 
the probability of decision (1, 2) for any given value of « is given by 
the area under the probability density curve for the given e, over the 
region (1, 2). Likewise the power function p(2, 1) for the same « value 
is given by the area under the same curve and over the region (2, 1). 
The functions p(1, 2) and p(2, 1) are represented by the reverse-sigmoid 
and the sigmoid curves in Figure 1b. 

(v) Significance and Protection Levels. The significance level, 
a = 5%, of this test is represented by the sum of the ordinates of the 
power curves in Figure 1b at « = 0, each of which is 23%. The protection 
level is 1 — a = 95%. In Figure la, the significance level is the sum 
of the areas under the dotted curve for « = 0, over the regions (1, 2) 
and (2,1). The protection level is the area of the same curve over the 
region (1, 2). Extensions of these familiar ideas will be useful in illus- 
trations of corresponding features in tests of inore than two means. 

The virtues of the 5% level normal-deviate three-decision test can 
be summarized most usefully as follows: The minimum protection 
against. making a wrong ranking of the two means is 95%, and, for all 
procedures for which this is true, the power curves of this test are 
uniformly maximized over all values of ¢ for which they measure prob- 
abilities of correct decisions, and are uniformly minimized over all 
values of « for which they measure probabilities of incorrect decisions. 
This provides a good example of the general usefulness of the new 
multiple power function analysis which we have adopted for this and 
for the more complex procedures. 


5.2 Tests of Three Means. General Details. 


(i) Sample Space. To represent a test involving three means, 
m, , M, , and m; , a two-dimensional sample space or plane is required 
in place of the one-dimensional sample space or line used above for a 
two-mean test. In this two-dimensional space it is convenient to plot 
the difference x, = m, — m, on the horizontal axis and the comparison 
X2 = (m, + m, — 2m;)/ +/3 on the vertical axis as rectangular Cartesian 
coordinates. Figures 2, 2a, 2b and 2c, and all subsequent sample 
space illustrations use these particular coordinates. It will be noted 
that x, is distributed independently of x, and has the same variance, 
a. = 2c, . This leads to certain helpful features of symmetry which 
will become evident as we proceed. 

Any set of values for the three differences m, — m, , m, — m;, , 
and m, — mz; , between the three means, can be represented by a sample 


22 BIOMETRICS, MARCH 1955 


point (x, , 22) in this two-dimensional sample space. For example, 


the set of differences m, — m, = 4,m, — m, = —1,andm, — m, = —5, 
found in the sample of means m, = 10, m, = 14, m; = 15, gives x, = 4 
and 2; = —2V3. These differences would thus be represented by 


the point (4, —2-°V3) located 4 units to the right of and 2V3 units 
below the center of the space. The inverse relations by which the differ- 


(1, 3,2) 


(1, 


(1,2,3) 
| (2,1,3 


FIGURE 2 


Regions of 5% Level Multiple Normal-Deviate Test (nz =©,o@m = 1) 


ences can be obtained from a sample point are m, — m, = x, , m, — 
m, = + V/322)/2, and m, — m; = (—2, + V322)/2. Thus a 
point (—2, 1) represents the set of differences m, — m,; = —2, m, — 
m, = —(2 — V3)/2, and m, — ms = (2 + V3)/2. 

(ii) Parameter Space. The plane used as a sample space in these 
figures may also be used for plotting values of the “true”? comparisons 
= — and = + we — 2u;)//3 between the true means 
involved. When used in this way it is termed a parameter space, and 
values for e, and ¢, constitute a parameter point (€, , €2). In the param- 
eter space we shall need to make frequent references to the parameter 
point (€, , €) = (0, 0), the origin, at which all true means are equal, 
i.e., at which un, = uw. = uz. Similarly we shall need to refer to the 


x 
(23,2) 
| 
(1,2,3) 
8 7077% 
(2,1,3) 
| 
2253 


MULTIPLE F TESTS 


‘qz SAUNDIA 


S 6 
( wy om 
(fx 09 
( 
“ pexyuea L 
7 
‘ 
| 
ex | 


i 
I 


24 BIOMETRICS, MARCH 1955 


dotted lines labelled wu, = wu. , uw, = uw; , and uw, = yw; in Figures 2a, 2b, 
and 2c, representing all points for which = = ws, and = 
respectively. The position of a parameter point on any one of the lines 
depends on the magnitude of the third mean relative to the two equal 
means represented by the line. 

(iii) Probability Density. The probability distribution of a sample 
point’ (x, , x.) depends only on (e, , €.) and from the definition of zx, 
and z, it is readily seen that the distribution function f(x, , X2 5 € , €2) 
is a bivariate normal one. Each 2; is distributed normally and inde- 
pendently about e; as mean and with a variance of 2. The distribution 
for any parameter point (e, , €.) can be visualized geometrically as a 
bell-shaped surface standing on the sample space plane with its center 
located over the given parameter point. 


5.3 The Multiple t Test. 


To illustrate the way in which a test can be represented in the 
sample space, we shall consider a previously mentioned special case of 
the procedure obtained by applying an a-level symmetric three-decision 
t test separately to each of the hypotheses, u, = w2, 4; = ws, 9nd uw. = py. 
This may be termed an a-level multiple t test, and readily generalizes 
to the case of n means in which the individual ¢ tests are applied to 
all ,C, hypotheses of the form u; = 4; which equate the means considered 
in all possible pairs. 

As has béen pointed out, this procedure does not provide a satis- 
factory test for our problem, and it is definitely not recommended for 
this purpose. We use it here and at other points in the discussion 
because of the excellent introduction it affords to better but more 
complex procedures. 

Under the special conditions n. = ¢, = 1, a = .05, the 
multiple ¢ test reduces to the 5% level ‘euliitle normal-deviate test. 
The 19 regions of this test are as shown in Figure 2: 

(i) Decision Regions. The regions of the joint test are formed by 
the symmetrical intersection of three sets of two-mean test regions as 
shown in Figures 2a , 2b , and 2c. In Figure 2a the lines m, — m, = 
—2.77 and m, — m, = 2.77 divide the sample space into three regions 
(1, 2), (1,2), and (2,1). The region (1; 2) consists of the entire vertical 
strip passing down the center of the plane between the lines 
m, — mM, = —2.77 and m, — m, = 2.77. The regions (1, 2) and (2, 1) 
are the remainders of the sample space plane lying to the left and 
right of (1, 2), respectively. These are the regions of the test of u, = p2 
and are two-dimensional extensions of the corresponding one-dimensional 
regions in Figure la. The notation has the same meaning as before; 


MULTIPLE F TESTS 25 


for example, if a point falls in (1, 2) the decision (1, 2) is made, namely 
that m, is significantly less than m, . 

Likewise, the lines m, — m,; = +2.77 in Figure 2b divide the sample 
plane into the three regions (1, 3), (1:3), (3, 1) for the test of u, = us, 
and the lines m. — m; = +2.77 in Figure 2c divide the sample plane 
into the three regions (2, 3), (2,3), (3, 2) for the test of uw. = u,;. The 
sets of regions for each of these tests are identical with those for the 
test of uw, = uw. , except for a rotation about the origin which is 60° 
counterclockwise for the first and 60° clockwise for the second. 

Each of the 19 product regions for the joint test in Figure 2 cor- 
responds to one of the 19 decisions previously listed for the case of 
three means. For example, in the intersection of (1, 2), (3, 1), and 
(3, 2) in the top left-hand corner of the figure, the associated decisions 
(1, 2), (3, 1), and (3, 2) constitute the joint decision (3, 1, 2). This, it 
will be recalled, is the decision that m, is significantly less than m, , 
mz, is significantly less than m, , and m; is significantly less than m, . 
‘the region involved may be thus conveniently denoted as the region 
(3, 1, 2). Likewise the intersection of the regions (1, 2), (1,3), and 
(2; 3) is the hexagonal region at the center in which the decision (1, 2; 3) 
is made. This may accordingly be denoted as the region (1; 2; 3). 

(ii) Power Functions. The power function p(l, 2), to take one 
of the six power functions involved, may be visualized as a power 
surface P{dec. (1, 2) | €: , €2] above the parameter space. The ordinate 
of the surface at any point (€, , €2) is given by the integral over the 
region (1, 2) of the bell-shaped distribution for that point. Since the 
boundary of region (1, 2) is parallel to the e, axis it is clear that sections 
of the power surface for different values of ¢, are identical. Each section 
is depicted by the reverse-sigmoid p({1, 2) curve shown for the two- 
mean test in Figure 1b. 

The remaining power functions p(1, 3), p(2, 3), p(2, 1), p(3, 1) 
and p(3, 2) may be visualized as power surfaces, identical with the 
surface for p(1, 2), except that the one for p(1, 3) is rotated 60° counter- 
clockwise about the origin, the one for p(2, 3) is rotated a further 60° 
counterclockwise about the origin, and so on. 

(iii) Protection Levels. The two-mean protection level y(1, 2) = 
minimum P [dec. (1,2) | «4, = u.] is the minimum integral over the 
strip-region (1, 2), of any of the normal bivariate distributions centered 
on the line u, = u2. Since the boundaries of (1, 2) are parallel to the 
line 4, = we , the minimum is given by the integral for any one parameter 
point (0, «), and is 95%. The remaining two-mean protection levels 
y(1, 3) and y(2, 3) can be seen to be 95% in the same way. 

The only remaining protection level is the three-mean level 


| 
| 


26 BIOMETRICS, MARCH 1955 


y(1, 2, 3) = Pldec. (1,2,3) | = we = This is given by the 
integral over the hexagonal region (1, 2, 3) of the bell-shaped bivariate 
normal distribution centered at the origin (0, 0). Since this region is 
the locus of all points for samples in which the range is less than 2.77, 
it follows that the integral is the probability P[¢ < 2.77], where q, 
is the standardized range of a sample of p inde; .dent observations 
from 4 normal popuiation. Tables for these probabilities are given 
by Pearson and Hartley (15), and from these a value of 87.8% is found 
for this three-mean protection level. According to the principle of 
protection levels based on degrees of freedom, the three-mean protection 
level should be 90.25%. 

In the test of four means the twelve power functions are similar 
to those of the simpler cases in that p(1, 2), for example, can be ex- 
pressed as a function of u, — wu, alone. In the reduced form p(1, 2) is 
identical with the p(i, 2) function of the two-mean test illustrated in 
Figure 1b. The six two-mean and four three-mean protection levels 
in this test are readily seen to be P{g, < 2.77] = 95% and Plq; < 
2.77] = 87.8% as for the corresponding levels in the three-mean test. 
The four-mean protection level is similarly found to be Pilg, < 
2.77] = 79.7%. 

As has been mentioned previously, it is the lowness of the three- 
mean and four-mean protection levels in these tests which invalidates 
them as satisfactory 5% level procedures. On the other hand their 
power functions considered individually have all of the optimum 
properties of those of the two-mean test. Similar properties are pos- 
sessed by a-level multiple ¢ tests in general. 

The general problem of finding a satisfactory test may be regarded 
as that of raising the higher order protection level values of an a-level 
multiple ¢ test to acceptable values, by methods which interfere as 
little as possible with its optimum power functions. 


5.4 Multiple Range Tests. 
5.4.1 The Newman-Keuls Test. 


A test proposed by Newman* (12) in 1939 and again by Keuls 
(10) in 1952 succeeds very simply in raising all of the low protection 
levels of the multiple ¢ test. This test is equivalent to a multiple ¢ 
test preceded by several preliminary range tests. Since the ¢ tests of 
which the multiple ¢ test is composed may be regarded as range tests of 


*Newman mentions that the principle of this test was initially suggested to him by “Student.” 


MULTIPLE F TESTS 27 


subsets of two means each, the overall procedure is composed entirely 
of range tests and may be usefully termed a multiple range test. 

An a-level Newman-Keuls multiple range test is given by the rule: 
The difference between any two means in a set of n means is significant 
provided the range of each and every subset which contains the given two 
means 1s significant according to an a-level range test. Thus in the case of 
three means under the special conditions n. = ©, ¢, = 1, a = .05, 
the difference m, — m, , for instance, is significant when the range of 
mM, , M2 , m; exceeds 3.32 (the 5% level value of the range of three 
means) and m, — m, exceeds 2.77. In the case of four means, m, — m, 
is significant when the range of m, , m2 , m3 , m, exceeds 3.63 (the 5% 
level value of the range of four means), the ranges of m, , m. , m; and 
m, , Mz , m, each exceed 3.32, and m, — m, exceeds 2.77. 


| 


NEWMAN-KEULS TEST 


NEW TEST 
(WITH CONSTANT (WITH SPECIAL 
PROTECTION LEVELS) . PROTECTION LEVELS) 


FIGURE 3 


5% level multiple range tests (nz =©, 0, = 1) 


The regions of the three-mean test are shown in Figure 3. These 
are the same as those of the corresponding multiple normal-deviate 
test except for the changes caused by the expansion of the region 
(1, 2, 3) from a regular hexagon with radius* 2.77 to a regular hexagon 
with radius 3.32. This raises the three-mean protection level from 
87.8% to 95%. On the other hand, the two-mean protection levels 
remain unaltered at 95%. For example, the level y(1, 2), which is the 
minimum integral over the modified strip region (1, 2) of any distribution 


*The radius of a hexagon will be used as short for the radius of the inscribed circle of the hexagon, 


28 BIOMETRICS, MARCH 1955 


centered on the line e, = 4, — uw. = 0, is unchanged because the region 
(1, 2) is unaltered away from the origin (€, , «.) = (0,0). The integrals 
are larger than 95% at the origin but drop to 95% as | € | increases. 

The six power functions are readily seen to be similar to those of 
the corresponding multiple normal-deviate test except for a ‘general 
lowering in the area around the origin. For example, p(1, 2) which 
is the integral over the region (1, 2) of the distribution centered at 
any point (e, , €2) is reduced by an amount equal to the integral over 
the trapezium shaped region which has been taken from (1, 2) and 
added to (1, 2). This reduction is greatest for a distribution centered 
at (€, , €) = (—3.04,'0) (the center of the trapezium) and gets less 
as the distance from this point increases. 

In the test of four means, the four-mean and three-mean protection 
levels are raised from 87.8% and 79.7% respectively to 95%, and 
corresponding reductions in the power functions accompany these 
changes. 


5.4.2 The New Multiple Range Test. 


The new multiple range test applied to the barley yield data in 
section 2 is a multiple range test like the Newman-Keuls procedure, 
except that, as has already been emphasized, it employs the special 
protection levels system based on degrees of freedom. A general 
a-level multiple range test of this type is given by the rule: The difference 
between any two means in a set of n means is significant provided the 
range of each and every subset which contains the given means is significant 
according to an a,-level range test where a, = 1 — Yp, 7p = (1 — a)”, 
and p is the number of means in the subset concerned. 

Figure 3 shows the regions of this test applied to three means under 
the same special conditions as before. These regions are identical 
with those of the corresponding Newman-Keuls test, also shown in 
Figure 3, except that the center hexagon has a radius of 2.92 instead 
of 3.32 and the adjacent regions are changed accordingly. This is 
sufficient to give the test a three-mean protection level of 90.25%. The 
two-mean protection levels remain unaltered at 95%, the same as in 
the Newman-Keuls test. 

The power functions of this test are similar to those of the Newman- 
Keuls test except that the reductions relative to the multiple normal 
deviate test are uniformly smaller, making the test uniformly more 
powerful. The reductions in p(1, 2), for example, are given as before 
by integrals over the trapezium formed by the intersection of the 
center hexagon (1, 2, 3) with the original (1, 2) region in Figure 2a. 
Since the hexagon is smaller than in the previous test, the trapezium 


MULTIPLE F TESTS ; 29 


is smaller, and the reduction integrals are therefore uniformly decreased. 
The difference in power is greatest at a point near the center (—3.04, 0) 
of the bigger trapezium and diminishes towards zero with increase of 
distance away from this point. 

In the case of four means, this test raises the four-mean protection 
level from 79.7% to 85.7% and the three-mean levels from 87.8% to 
90.25% in a similar way. The two-mean protection levels remain 
unaltered at 95%. Likewise the power functions are uniformly lower 
than those of the corresponding multiple ¢ test but uniformly higher 
than those of the corresponding Newman-Keuls test. 

The gains in power in the new multiple range test are quite appre- 
ciable, expecially for some parameter points and are entirely due to 
use of protection levels based on degrees of freedom. In passing, the 
independent tests analogy used in support of these new levels may be 
illustrated for purposes of comparison by the regions of the test shown 
in Figure 4. These are the regions of two 5% level independent normal 
deviate tests of x, = m, — m, and zx, = (m, + m, — 2m;)/ V3 respec- 
tively, assuming n. = © ando,, = 1 as before. Tests like these would 
be needed, for example, if m, and m, were grain yields from two strains 
of one barley variety (A) and m, were the yield of another variety (B). 
Attention under these circumstances might well be restricted to testing 
the difference x, between the two strains of variety A and the difference 
x, between the two varieties A versus B. 

The case for protection levels based on degrees of freedom may be 
put very briefly in terms of the tests illustrated in Figures 3 and 4, 
as follows: Because of the independence of its two component tests, 
the joint test in Figure 4 is a valid and acceptable joint procedure. 
The square region (1; 2,3) at the center of this test has the same 
function as the hexagonal region at the center of a multiple range test 
in that it is the locus of all points which do not lead to the rejection of 
the hypothesis = = us (which implies , = (0, 0)). It is 
adequate, therefore, to increase the dimensions of the hexagonal region 
in a multiple range test only so far as is needed to make the integral of 
the distribution at origin (0, 0) over this region equal to the integral 
of the same distribution over the square region in Figure 4. The latter 
integral is 90.25% and the hexagonal region of the new multiple range 
test in Figure 3 has been constructed in this way. 


5.4.3 Tukey’s Test Based on “Allowances.” 


In 1951 Tukey (22) introduced a procedure for estimating confidence 
intervals, or “allowances” as he called them, for the differences u; — yu; 
which we have been considering. He defined a confidence coefficient 


30 BIOMETRICS, MARCH 1955 


8 for the joint procedure as the probability that all intervals simul- 
taneously contain the values of the corresponding true differences. 
This method can be used to give, among other things, a significance 
test for our general problem. If, in a procedure with confidence co- 
efficient 8, the confidence interval for 4; — u; is denoted by /,;(8) this 
test may be expressed as the following rule: Make the decision (i, j) if 


x, * 


x, * 2.77~) 


x, * 2.77.) 


FIGURE 4 
Regions for 5% Level Joint Normal-Deviate Tests of Two Independent Comparisons (nz =©,¢z = »/2) 


T,;(8) lies to the left of zero, the decision (t, J) if I;;(8) includes zero, or the 
decision (j, 1) if I;;(8) lies to the right of zero. An a-level test, by the 
originator’s definition, is obtained by putting 8 = 1 — a. 

The test given in this way for three means, under the special con- 
ditions n. = ~©,¢, = 1, a = .05, is identical with the multiple normal- 
deviate test shown in Figure 2 except that the width of each of the 
strips (1, 2), (1, 3), (2,3) is increased from 2 X 2.77 to 2 X 3.32. The 


x, = 2.77 
| 
| : 
| 
- — — 
(1.2.3) 
90.25% 


MULTIPLE F TESTS 


31 


method of derivation from confidence intervals implicitly imposes the 
restriction that the boundaries of (1, 2), (1, 3), and (2; 3) must be parallel 
straight lines. The distance between the lines is widened until the 
dimensions of the center hexagon (1, 2, 3) are as large as those of the 
Newman-Keuls test, thus making the three-mean protection level 
1 — a = 95%. At the same time the two-mean protection levels are 
increased uniformly from 95% to 98.1%. This test is readily seen to be 


more conservative and uniformly less powerful than any of the previous 
procedures. 


5.4.4 Tukey’s 1953 Multiple Range Test. 


In 1953 Tukey (23) relaxed the conservatism of the previous test 
somewhat by proposing a multiple range procedure in which the sig- 
nificant ranges are each midway between the ones required by the 
test based on allowances and those required by the Newman-Keuls 
test. In the case of three means, under the same special conditions as 
before, the regions of this test are the same as those of the Newman- 
Keuls procedure except that the widths between the parallel lines are 
increased from 2.77 to 4(2.77 + 3.32) = 3.04. The hexagon radius is 
3.32 in both tests. 

In suggesting this test, Tukey drew attention to an important 
point which may be illustrated by the following example. Suppose 
that in a 5% level Newman-Keuls test of four means, again assuming 
nN, = © and o, = 1, the values of the true means are un, = we =u 
and ws = us = uw» + 6. Suppose the difference 5 between the two groups 
of means is so large that the preliminary range tests are practically 
certain to be significant, then the probability of jointly deciding that 
both | m, — m, | and | m; — m, | are not significant is P[| m, — m.| < 
2.77] X P| m; — m,| < 2.77] = 90.25%. This is an example of a 
whole set of levels, which we may call class 2 protection levels, which 
are not raised to (1 — qa) in an a-level Newman-Keuls test and are 
more akin to levels based on degrees of freedom. Both of Tukey’s 
procedures have been designed with the objective of raising these 
class 2 protection levels along with the others to at least (1 — a). 
The 1953 test is a modification of the test based on allowances which is 
uniformly more powerful than the later but which, Tukey judges, 
still meets his given objective. 

When protection levels based on degrees of freedom are adopted, 
as in the new multiple range test, the class 2 levels are automatically 
fixed at, or slightly above (when n, is small), their appropriate values 
and need no special attention. 

In the case of the Newman-Keuls procedure it is not clear whether 


i 
ab 
4 


BIOMETRICS, MARCH 1955 


either one of the authors was aware of the presence of these lower 
levels and whether he would wish to defend them as this writer does or 
not. 


5.5 Multiple F Tests. 


A series of tests paralleling the above multiple range tests can be 
defined using F tests instead of range tests. These may conveniently 
be termed multiple / tests. Thus, corresponding to the new multiple 
range test, an a-level multiple F test with protection levels based on degrees 
of freedom may be defined by the following rule: Rule 1. The difference 
between any two means in a set of n means ts significant provided the vari- 
ance of each and every subset which contains the given means ts significant 
according to an a,-level F test where a, = 1 — y,,Y> = (1 — a)” ', and 
p is the number of means in the subset concerned. 

In the case of three means under the special conditions n. = —, 
om = 1,a = .05, the regions of this test are as shown in Figure 5. These 
regions are the same as those of the corresponding multiple normal- 


FIGURE 5 


5% level multiple F tests with special protection levels (nz = ©,d, = 1) 


deviate test except that the strip-regions (1,2), (1,3), (2,3) have 
their boundaries expanded to those of the circle centered at the origin, 
with radius 3.05. This radius 3.05 is calculated as W/4F, where* F is 
the 9.75% significant value of an F ratio with degrees of freedom 2 
and . If the center region (1, 2,3) were comprised of the circle 
alone, this would raise the three-mean protection level to just 90.25% 


*This test requires special F tables or equivalent tables as given in (6), Tables 1 and 2. 


39 

| 3405 i 


MULTIPLE F TESTS 


33 


as desired. The six small areas outside the circle but inside (1, 2, 3) 
give the test a slightly higher protection level than 90.25%, which is 
not necessary and makes some modification of Rule 1 desirable. 

The multiple F test can be generalized to test the significance of 
all linear comparisons of the form c = , where k, , ky, +--+ , 
is any set of arbitrary constants such that >07., k; = 0. (Each 
linear function of this form can be regarded as the difference between 
weighted means of two subsets of the full set of means.) The general 
rule is: Rule 2. Any comparison of the form c = }~"_., km; is significantly 
different from zero provided the variance of each and every subset which 
contains all of the means involved in c is significant according to an a,-level 
F test and provided also that c differs significantly from zero according to 
an a-level t test where a, = 1 — y,, 7, = (1 — a)”’, and p is the number 
of means in the subset concerned. By “all of the means involved in c” 
is meant all means which have non-zero coefficients in the linear func- 
tion c = km, . 

The regions of this more general test, under the same special con- 
ditions, are also shown in Figure 5. The three intersecting strip regions 
given by Rule 1 are now replaced by an infinity of strips, all of which 
pass symmetrically through the center of the sample space and inter- 
sect each other at all angles. Each strip and the areas to either side 
of it represent the test regions for the comparison measured at right 
angles to the axis of the strip. For example, the strip region between 
the heavy lines in the illustration contains points for samples in which 
the comparison c = }m, + 3m; — m, is not significantly different 
from zero. The areas to either side of this region contain points for 
samples in which the comparison is significantly positive or negative. 


5.5.1, The Multiple Comparisons Test. 


The multiple comparisons test proposed by the author in 1951 
(6, 7) is a multiple F test which consists of a compromise between 
Rule 1 and Rule 2. As many significant differences as possible are 
found by the Rule 1 test. Rule 2 is then used to test any comparisons 
of interest within subsets of means not already found to contain sig- 
nificant differences by Rule 1. 

Figure 6 shows the regions of this test under the same special con- 
ditions as before. These regions are identical with those of the Rule 1 
test in Figure 5 except for the additional six regions lying outside the 
circle and inside the original hexagon. These represent regions in 
which comparisons involving all three means are found to be significant. 
In the small region at the top of the circle, for example, various weighted 
means of m, and m, are significantly larger than m; . 


# 
> 


BIOMETRICS, MARCH 1955 


FIGURE 6 
5% Level Multiple Comparisons Test (nz =©,@m = 1) 


5.5.2 The Least-Significant-Difference Test. 


The basic principle of using a preliminary homogeneity of means 
test to raise a low protection level was first proposed by R. A. Fisher (9). 
A test which has arisen out of his discussion is the least-significant- 
difference test already mentioned in the introduction. 

A general a-level test of this type is given by the rule: The difference 
between any-iwo means in a set of n ts significant provided that the difference 
is significant according to an a-level t test and provided also that the variance 
of the whole set is significant according to an a-level F test. 

In the case of three means, this is identical with an a-level Rule 1 ~ 
multiple F test with constant levels. The regions of the test under . 
the same special conditions as before are the same as those of the Rule 1 
multiple F test with special levels in Figure 5 or the multiple comparisons 
test in Figure 6 except that the radius of the circle is increased to 
V4F = 3.46, F now being the 5% level value of the F ratio with degrees 
of freedom 2 and , 


| 
| 
| 
-> 
90.25% | 
| 


MULTIPLE F TESTS 


35 


In the more general case with n means, n > 3, the least significant 
difference test does not use all of the F tests prescribed by a multiple 
F test and fails to fix adequate values for all of the protection levels 
involved. For example in a test of four means, assuming n2 = ©, 
om = 1,a = .05 as before, we find y. = 95%, 7; = 87.8%, and y, = 95%. 
The value y; of the three-mean protection levels is as low as that of the 
corresponding multiple normal deviate test. In general, the value y, 
of any p-mean protection level in an a-level least significant difference 
test is as low as the y, value in the corresponding a-level multiple ¢ 
test with the one exception that y, is raised to 1 — a. 

Thus while this test is more conservative than the new multiple 
range test or the multiple comparisons test for the case of three means, 
it is less conservative in cases with more than three means. 


5.5.3 Scheffé’s Test Based on Contrasts. 


A recent procedure proposed by Scheffé (19) may be described as 
the F test analogue of Tukey’s test based on allowances. 

In the case of three means under the same conditions as before, 
the regions of this test are generated by the symmetrical intersection 
of strip regions with straight boundaries like those of the multiple 
normal-deviate test except that (i) the width of the strips is 2 X 3.46 
instead of 2 X 2.77, and (ii) the strips are infinite in number as in the 
Rule 2 multiple F test. The intersections of these strips form a circle 
of radius 3.46, at the center and this gives the test a three-mean protec- 
tion level of 95%. At the same time the strip-region protection levels 
are raised, by the increases in strip-widths, from 95% to 98.6%. 


5.6 Other Decision Procedures. 


As mentioned previously several writers including Bechhofer (1) 
have dealt with a problem which may be regarded as a special case of 
the general one with which we have been concerned, and procedures 
have been proposed which may be regarded as degenerate multiple 
range or multiple F tests. The decision procedures proposed in the 
given reference, for example, are for deciding that the ¢ largest means 
in a sample of n means m, , m, , --- , m, are all significantly larger 
than all of the remaining n — ¢ means. In one procedure the true 
means corresponding to the ¢ largest observed means are not ranked 
relative to one another; in another procedure they are. In both cases 
the true means in the remaining subgroup are left unranked relative 
to one another. To take a simple illustration, in a procedure for choos- 
ing the largest mean among four, that is, 4 = 1 and n = 4, the decisions 


36 BIOMETRICS, MARCH 1955 


in terms of our previous notation are (1, 2, 3, 4), (1,2) 4, 3), (1.3, 4, 2) 
and (2; 3, 4, 1), where (1; 2, 3, 4), for example, is the decision that yu, 
is larger than each of the remaining means, which are left unranked 
relative to another. 

One very restrictive result of eliminating the missing decisions is 
that all of the protection levels of the procedure are forced to zero, or 
in other words all of the significance levels are forced to 100%. For 
example, in a procedure involving only two means, the experimenter is 
forced to make the decision (1, 2) or (2, 1). Thus, if it so happens 
that u, = uw. the probability of making a wrong decision is 100%. 
The power curves of this test are similar to the p(1, 2) and p(2, 1) curves 
illustrated for the 5% level test in Figure 1b except that each curve is 
forced to pass through the 50% power value at « = uw, — uw. = 0. The 
usefulness of these procedures is therefore restricted to problems in 
which the experimenter feels impelled to choose a best mean from the 
results of the given experiment alone. 

By limiting themselves to procedures with zero protection levels 
at the outset, the authors of these tests have been able to avoid the 
controversial problem of consistent protection levels and to concentrate 
on other problems such as the tabulation of relations between power 
functions and sample sizes, (Bechhofer, 1), and-the optimum choice of’ 
the size of an experiment based on minimax considerations, (Somerville, 
20). 


6. CONCLUDING REMARKS 


Most of the foregoing procedures can be classified usefully according 
to three basic characteristics: 


1. Type of significant differences: separating a procedure such 
as the Newman-Keuls test having a set of significant ditferences 
which decrease as the test proceeds, from a procedure such as 
Tukey’s test based on allowances which has one constant sig- 
nificant difference. 

2. Type of protection levels: separating a procedure such as 
the New>:an-Keuls test having constant values (or lower 
limits) o. (1 — a) for its protection levels*, from a test such 
as the new multiple range test having protection levels based 
on degrees of freedom. 

3. Type of component tests: separating procedures into several 
categories according to whether they employ range tests, 
F tests, or component tests of another type. 


*excluding class 2 protection levels. 


of this kind. 


TABLE V. 


MULTIPLE F TESTS 


considerable simplicity. 


constant significant differences. 


Table V shows the allocation of several procedures in a classification 


The most important of these characteristics is the first, separating 
tests la, with decreasing significant differences, from tests lb, with 
The nature of the confidence interval 
methods from which the 1b tests are derived is such that in an applica- 
tion of one of these tests there is only one single significant value against 
which all differences or linear comparisons are tested. This makes for 
However, the single significant value has to 
be so high that the power functions are severely reduced. 


CLASSIFICATION OF TEST PROCEDURES ACCORDING TO THREE BASIC 
CHARACTERISTICS 


37 


2. Type of 
Protection 
Levels 


1. Type of Significant Differences 


la) Decreasing 
3. Component Tests 


1b) Constant 
3. Component Tests 


3a) Range 


3b) F 3a) Range 


2a) None less 
than constant 
values 

(1 — a) 


Newman- 
Keuls 
Test 


Tukey’s 
Test 
Based on 
Allowances 


Scheffé 


Test 


Protection 
Levels Based 

on degrees 

of freedom 

% = (1 


New 
Multiple 
Range 
Test 


Multiple 
Comparisons 
Test 


5.01. 


For example, in a 5% level Tukey test based on allowances for a 
case with 20 means (again assuming n. = ©, o,, = 1), the significant 
ranges all have the same value 5.01, as shown in Table VI. This value 
5.01 is equal to the largest of the significant ranges of the corresponding 
la test, a 5% level Newman-Keuls test, for which the significant ranges, 
also shown in Table VI, decrease with subset size from 5.01 down 
to 2.77. In the la test, a difference between two means which exceeds 
only 2.77 can be significant depending on the disposition of the other 
means. In the 1b test no difference can be significant without exceeding 


Comparing these two tests further, consider two true means in 
particular, say uw, and yw, , and suppose that y, is smaller than yp, . 


Let 


38 BIOMETRICS, MARCH 1955 


#, and yw, on one hand be well separated from the remaining true means 
Ms On the other. For example, suppose 3(u, + w2) = 120 
and ps = wy = --* = fa = 100. Under these circumstances, recalling 
that o,, = 1, the observed means m, and m, will be well separated from 
the remaining observed means m, , mM, , --* , Mo. Because of this, the 
ranges of all subsets of three or more of the observed means which 
inclyde m, and m, are practically certain to be significant. Thus in 


TABLE VI. COMPARISON OF SIGNIFICANT RANGES FOR 5% LEVEL TESTS OF 20 
MEANS 


| 


Subset Sizes 
Test 


2 3 4 5 6 8 10 14° 20 


Tukey’s Test 
Based on 


Allowances 5.01 | 5.01 | 5.01 | 5.01 | 5.01 | 5.01 | 5.01 | 5.01 | 5.01 
Tukey’s 1953 


Test 3.89 | 4.16 | 4.32 | 4.44 | 4.52 | 4.65 | 4.74 | 4.88 | 5.01 
Newman-Keuls 
Test 2.77 | 3.32 | 3.63 | 3.86 | 4.03 | 4.29 | 4.47 | 4.74 | 5.01 


New Multiple 
Range Test oii as 3.02 | 3.09 | 3.15 | 3.23 | 3.29 | 3.38 | 3.47 


the la test the probability of correctly deciding that y, is less than p° 
will be virtually the same as if the remaining means were not present, 
that is, 


Pia(1, 2) = Pldec. (1, 2) | Me — = Pim, — m, < —2.77 | He — mil. 
For the 1b test, however, the corresponding power is given by 
Pu(1, 2) = Pldec. (1, 2) | — wi] = Pim, — m, < —5.01 | — 


Table VII shows the values of these two functions and their differ- 
ences for various values of u. — yu, . The differences represent the losses 
in power in the 1b test relative to the la test and some of these can be 
seen to be very large. 

At other parameter values in a 20-mean test, with other arrange- 
ments of the true means, the relative losses in power will not be as 
great. However, it is clear that losses will occur at all values of the 
parameters and many will be considerable. For tests involving more 
than 20 means the differences in power will be even greater, increasing 
as the number of means increases. 


MULTIPLE F TESTS 39 


TABLE VII. SEVEREST POWER LOSSES OF 1b TEST RELATIVE TO 1a TEST (5% LEVEL 
TESTS OF 20 MEANS) 


we — M1 la Test 1b Test Loss 
0 .0250 .0002 0248 
1 . 1056 .0023 1033 
2 2946 2780 
3 4858 
4 .2389 5689 
5 4960 4469 
6 9887 .7580 2307 
7 .9207 0779 
8 .9999 .9826 0173 
© 1.0000 1.0000 0.0000 


Similar decreases in power must occur in all 1b tests using constant 
significant differences. These losses appear unnecessary and tests of 
this type are therefore not recommended. 

A partial concession to this point of view is made by Tukey (23) 
in his 1953 test already mentioned. The significant ranges for this test 
lie midway between those of the corresponding la and 1b tests. An 
example of these under the conditions already used for the previous 
20-mean test examples is also given in Table VI. A test of this type, 
however, still suffers considerable losses in power probabilities relative 
to the Newman-Keuls procedure and is also considered to be unneces- 
sarily conservative. 

The second most important characteristic is the one concerning 
protection levels. This separates tests 2a, using constant values (or 
lower limits) for protection levels, from tests 2b, using the special lower 
limits based on degrees of freedom. 

As has already been mentioned, the power functions of the 2a 
tests are uniformly lower than those of the corresponding 2b tests. 
Some further idea of this may be obtained from Table VI by comparing 
the Newman-Keuls significant ranges, discussed above, with those of 
the corresponding new multiple range test, which have been taken 
from Table II, row n, = ~. 

Each of these tests requires that a difference between any two means 
must exceed 2.77 before it can be significant and each thus has two- 
mean protection levels of 95%. The significant ranges for subsets of 
more than two means, however, are larger in the 2a test. As a result 
of this, some differences which may not be significant in the 2a test 
may be significant in the 2b test. It can be seen that the amounts by 


40 : BIOMETRICS, MARCH 1955 


which the power functions of the 2b test exceed those of the 2a test 
are greatest around the origin un, = wu. = --- = 2 and decrease toward 
zero in certain directions away from this point. The same holds for 
any 2b test, relative to the corresponding 2a test. 

There appears to be no sound reason for not using protection levels 
based on degrees of freedom thereby gaining considerably in power to 
detect real differences. 

Finally, there is the subdivision of the test procedures according to 
the type of component tests employed. In this paper we have considered 
only procedures based on range tests (3a) and F tests (3b). However, 
other types of component tests, for example, extreme deviate tests 
and gap tests, have been proposed and one procedure given by Tukey 
(21) is based on a combination of three types of component tests. 

The problem of deciding the relative merits of various types of 
component tests is complex, and much work needs to be done in this 
direction. At present, it appears that the best choice lies between 
range tests and F tests. The relative merits of these depend on the 
objectives involved. 

Under some circumstances (i), interest may lie in testing linear 
comparisons involving several means as well as differences between 
single means; under others (ii), interest may be restricted to testing 
only differences between single means. 

Under circumstances (i) additional power functions are needed to 
measure the power of the test with respect to the additional comparisons 
involved. When these are all included it seems safe to assume that 
multiple F tests are more powerful in some average sense than multiple 
range tests. Under circumstances (ii), however, the relations are more 
obscure. The preliminary tests in a multiple F test with decreasing 
significant differences (la tests) may cause a little less general inter- 
ference* with subsequent tests than do the preliminary range tests in 
a corresponding multiple range test. In this event, the multiple F 
tests may still be more powerful in an average sense but only slightly so. 

The important deciding factor under circumstances (ii) will often 
be the difference in time and effort required in applying the two types 
of tests. The application of a multiple range test is much easier and 
a test of this type will generally be preferred for this reason. 

To summarize, the features recommended in each classification are: 


1. Decreasing significant differences, as used in tests la; 


*This does not apply of course in 1b tests with constant significant differences, in which case the 
use of range tests gives more powerful procedures. Thus, for example, under circumstances (ii), 
Tukey’s test based on allowances is uniformly more powerful than Scheffé’s test 


MULTIPLE F TESTS 41 


2. Protection levels based on degrees of freedom, as used in tests 2b; 
and 

3. Range tests as used in tests 2a, unless one is interested in linear 
comparison other than differences between single means, in 
which case F tests are recommended, as used in tests 3b. 


The new multiple range test and the multiple comparisons test have 
been designed to include these recommended features. 


Computation of Tables II and III for New Multiple Range Test. 


Let Q(p, n2 , a) represent the entry for given values of p, n. , and 
a given in Tables II and III for a = .05 and .01, respectively. Put 
R(p, m2 , Yv.2) for the 100y,,. percentage point of the studentized 
range where y,,4 = (1 — a)”’. Then the tabled values have been 
computed from the relation Q(p, n. , a) = R(p, nz , ¥p,2) for p = 2, 
and from Q(p, n2 , a) = R(p, , ¥p,2) Or Q(p — 1, Nz , a), whichever 
is the larger, for all other values of p. This ensures that each p-mean 
protection level in the new multiple range test is y,,. for all values of p. 

The studentized range values R(p, nz, ¥p,.) for 2 < p < 20 and 
10 < n, < @ used in this process have been obtained from Pearson 
and Hartley’s Tables (16). The remainder of the R(p, nz , ¥p,2) values 
involved have been obtained by new methods (see Beyer, 2) specially 
developed for this purpose. 

Acknowledgment. The author is indebted to W. H. Beyer for oii 
of the theoretical developments and the computational work involved 
in getting the values R(p, n2 , Yp,2) of the studentized range required 
for Tables II and III as explained above. 


7. REFERENCES 


(1) Bechhofer, Robert E., ‘“‘A Single-Sample Multiple Decision Procedure for Rank- 
ing Means of Normal Populations with Known Variances,’’ Annals of Mathe- 
matical Statistics, 25, 16-39, 1954. 

(2) Beyer, William H., “Certain Percentage Points of the Distribution of the 

Studentized Range of Large Samples,” Virginia Polytechnic Institute M.S. 

thesis, 56 pp., 1953. 

Cornfield, J., Halperin, M. and Greenhouse, 8., ‘‘Simultaneous Tests of Signifi- 

cance and Simultaneous Confidence Intervals for Comparisons of Many Means,” 

unpublished mimeographed notes, National Institutes of Health, Public Health 

Service, Department of Health, Education and Welfare, Bethesda, Maryland, 

18 pp., 1953. 

(4) Davies, Owen L., “Statistical Methods in Research and Production,” 2nd ed., 
London, Oliver and Boyd, 1949. 

(5) Duncan, D. B., “Significance Tests for Differences between Ranked Variates 


Drawn from Normal Populations,” Iowa State College Ph.D. thesis, 117 pp., 
1947. 


(3 


42 BIOMETRICS, MARCH 1955 


(6) Duncan, D. B., “A Significance Test for Differences between Ranked Treat- 
ments in an Analysis of Variance,”’ Virginia Journal of Science, 2, 171-189, 1951. 

(7) Duncan, D. B., “On the Properties of the Multiple Comparisons Test,”’ Virginia 
Journal of Science, 3, 49-67, 1952. 

(8) Duncan, D. B., “Multiple Range Tests and the Multiple Comparisons Test,” 
(Preliminary Report), Biometrics, 9, Abstract 220, 1953. 

(9) Fisher, R. A., “The Design of Experiments,” six eds., London, Oliver and Boyd, 
1935-1951. 

(10) Keuls, M., “The Use of the ‘Studentized Range’ in Connection with an Analysis 
of Variance,” Euphytica, 1, 112-122, 1952. 

(11) Lehmann, E. L., “Some Principles of the Theory of Testing Hypotheses,” 
Annals of Mathematical Statistics, 21, 1-26, 1950. 

(12) Newman, D., “The Distribution of the Range in Samples from a Normal Popu- 
lation, Expressed in Terms of an Independent Estimate of Standard Deviation,” 
Biometrika, 31, 20-30, 1939. 

(18) Neyman, J., “First Course in Probability and Statistics,’ Henry Holt and 
Company, Inc., New York, 1950. ; 

(14) Paterson, D. D., ‘Statistical Technique in Agricultural Research,” McGraw- 
Hill Book Company, Inc., New York, 1939. 

(15) Pearson, E. S., and Hartley, H. O., “The Probability Integral of the Range in 
Samples of n Observations from a Normal Population,” Biometrika, 32, 301-310, 
1942. 

(16) Pearson, E. S., and Hartley, H. O., “Tables of the Probability Integral of the 
‘Studentized’ Range,” Biometrika, 33, 89-99, 1948. 

(17) Roy, 8. N. and Bose, R. C., “Simultaneous Confidence Interval Estimation,” 
Annals of Mathematical Statistics, 24, 513-536, 1953. 

(18) Sawkins, D. T., unpublished work, University of Sydney, 1938. 

(19) Scheffé, H., “A Method for Judging All Contrasts in the Analysis of Variance,” 
Biometrika, 40, 87-104, 1953. 

(20) Somerville, P. N., “Some Problems of Optimum Sampling,” Biometrika, 41, 420- 
429, 1954. 

(21) Tukey, J. W., “Comparing Individual Means in the Analysis of Variance,” 
Biometrics, 5, 99-114, 1949. 

(22) Tukey, J. W., “Quick and Dirty Methods in Statistics,” part II, Simple Analyses 
for Standard Designs, Proceedings Fifth Annual Convention, American Society for 
Quality Control, 189-197, 1951. 

(23) Tukey, J. W., “The Problem of Multiple Comparisons,” unpublished dittoed 
notes, Princeton University, 396 pp., 1953. 

(24) Wald, A., “Contributions to the Theory of Statistical Estimation and Testing 
Hypotheses,”’ Annals of Mathematical Statistics, 10, 299-326, 1939. 

(25) Hartley, H. O., “Some Significance Test Procedures for Multiple Comparisons,” 
Annals of Mathematical Statistics, 25, Abstract 19, 1954. 


FURTHER CONTRIBUTIONS TO THE THEORY OF 
PAIRED COMPARISONS’ 


M. G. KENDALL 
Visiting Professor, Institute of Statistics, North Carolina State College 


1. When a pair of objects is presented for comparison and the two 
are placed in the relationship preferred: not-preferred, we have what is 
known as a paired comparison. <A set of n objects can be compared, 
a pair at a time, in some or all of the possible n(n — 1)/2 ways of choosing 
a pair, and the set of paired comparisons so derived gives us a picture 
of the interrelationships of the objects under preference. A paired- 
comparison scheme is more general than a ranking; for with the latter 
A-preferred-to-B and B-preferred-to-C automatically ensures A-pre- 
ferred-to-C, whereas with paired comparisons it might happen that C 
was preferred to A. The existence of these departures from the ranking 
situation may be due to various reasons, such as the fact that ‘pre- 
ference’ is a complicated comparison being made with reference to 
several factors simultaneously; and one reason for using paired com- 
parisons is to give such effects a chance to show themselves. 

2. Situations often occur in which a set of m observers express 
preferences among n objects and we have to select that object, or perhaps 
that sub-set of objects, which are, in some sense, “most preferred.” 
The simplest case is the one where there are only two objects, A and B, 
and every observer votes for either A or B as president of an institution. 
If 51 per cent of the votes are cast for A and 49 per cent for B we declare 
A elected. In doing so we have satisfied 51 per cent of the preferences 
but have had to proceed contrary to 49 per cent; we may say that 
49 per cent of the preferences were violated. More generally, when we 
have to select a subset of the n objects as “elected” we shall in general, 
in the absence of complete unanimity, violate a number of preferences. 
Circumstances force us to do so to some extent. The problem is to do 
so to the least possible extent. 

3. Cons‘Jer the case in which 8 members of a body have to elect a 
committee of three from among themselves. We will suppose that no 
member votes for himself (though this makes no essential difference) 
and that there are no abstentions (though this too makes no essential 


1This research was supported by the United States Air Force, through the Office of Scientific 
Research of the Air Research and Development Command. 


43 


i 
i: 
| 
4 
: 
| 
| 


44 BIOMETRICS, MARCH 1955 


difference). If the 8 members are represented by the letters A to G 
they might vote as follows: 


Member Members Preferred 


(1) 


Here, for the moment, we suppose that there is no preference expressed 
among the triplets of members preferred; that is to say, A prefers 
B, D, E but does not say whether B is preferred to D or EF, or D to E. 
He might then have written down his nominees in any order. 

Under this system each elector expresses 9 preferences. A, for 
example, says, in effect, that he prefers B to C, F, and G, prefers D to 
C, F, and G, and prefers E to C, F, and G. There are thus 63 preferences 
altogether. We will represent this scheme in a two-way array of the 
following kind: 


Here, if A is preferred to B (a relationship we shall henceforward write 
as A pref. B or A — B) we write a unit in the row A, column B. For 
example C prefers D, G, A to each of B, E, F. We therefore have 
units in row D, Col. B; row D, Col. E; row D, Col. F; row G, Col. B; 


A BDE 
B DAF 
C DGA 
D CBE 
E ABC 
F ACD 
G BAC 
No. of 
A B C D E F G prefer- 
ences 
A _ ll 111 | 1111 111 111 15 
B 1 _ 1 11 1 | 1111 111 12 
Cc 1 1 — 1il 111 111 | 1111 15 
D ll 1 — 11 11 11 9 
E 1 1 — 11 11 6 
F 1 1 — 1 3 
G 1 1 1 — 3 
Totals 3 6 3 9 12 15 15 63 (2) : 


PAIRED COMPARISONS 45 


row G, Col. E; row G, Col. F; row A, Col. B; row A, Col. EZ; row A, 
Col. F. The totality of preferences expressed in (1) is given in the 
array (2), together with row and column totals. 

Notice that: (a) the sum of row and column totals for each letter 
is 18. This provides a check. The reason is that each of the letters is 
compared with three others by each of six observers, so that each letter 
has 18 preferences (one way or the other). 

(b) each column or row total is a multiple of three; for if any letter 
is preferred at all by an observer it is preferred to three others. 

4. From the array (2) we see that A and C had 15 preferences each. 
If all preferences expressed by all observers have equal weight there is 
nothing to choose between them. B comes next with 12 preferences. 
All the others have fewer. Thus, if we have to elect three out of the 
seven to form a committee, we elect A, B and C. 

5. The procedure we have followed exhibits the structure of the 
preference scheme most clearly, but for the purposes of electing a 
committee of three we can proceed much more expeditiously. In fact, 
from array (1) we see that the voting is as follows: 


Member Number of votes 
A 5 
B 4 
Cc 5 
D 3 
E 2 
F 1 
G 1 
21 (3) 


A comparison of this with (2) shows that in the latter the row totals are 
thrice the number of votes. The reason is easy to see, for if any letter 
gets a vote it is thereby preferred to three others. 

6. Now let us suppose that the rules of election are altered slightly 
and that each elector writes down the three members he prefers in 
order of preference. Such an order might be that of array (1) where, 
for example, A gives B his first preference, D his second and E his 
third. Each elector now expresses 12 preferences, three among the 
set he nemes and 9 by implication between those three and the three 
he omits. If we now form an array of preferences we get, instead of (2) 


4 i 


46 BIOMETRICS, MARCH 1955 


The antisymmetry of the table has now been lost and row or column 
totals are no longer divisible by three. But we could still pick out the 
three members with the greatest number of preferences (A, C, B as 
before) without constructing a full table. In fact from (1) we score 
for A the following preferences allotted by the electors B to G: 


and so for the other letters. The scores are the preference totals in the 
final column of (4). 


Q 


| ww] 


— 


a 


7. The same method can obviously be applied to any number of 
voters and any size of committee. Under the condition that there are 
no abstentions and that nobody votes for himself, the total number of 
preferences expressed by m voters for a committee of n (no preferences 
between committee nominees) is mn(m — n — 1); or if preferences 
are expressed by ranking nominees, is mn(m — n/2 — 3). We may 
now, if we wish, relax some of the conditions without affecting essentials. 

(a) If every man is allowed to vote for himself nothing new is 
introduced so long as we adhere to the principle of giving each preference 
the same weight; 

(b) The same principles apply when a number of electors express 
preferences concerning a group of individuals who are not members of 
themselves. If m judges express preferences for k out of n objects 
(without ordering them) the number of preferences is mk(n — k). 

(c) If there are any abstentions we can continue as before to count 
those preferences which are expressed. Suppose, for example, that 
instead of (1) we had the following preferences expressed (second 
column): 


| | 
Totals 
21 
18 
20 
12 
6 
3 
4 
Totals | 7 8 12 | 16 17 | 84 (4) 


PAIRED COMPARISONS 


Preferences Corrected Preferences 


(5) 


We suppose that these are in order. Member C has overstepped the 
mark. Unless we reject his ballot as spoiled we delete B from his 
ordering. Member B prefers C to A and both to the other four, but 
cannot express a preference between those other four and hence submits 
only two names. Member G tries to “plump” but we disallow this 
and count his expression as a preference for B only. We now have the 
preferences in the third column of (5) giving the following: 


(6) 


A, B and € are still elected but B now gains more preferences than A. 

We notice that election on this principle maximizes the number of 
satisfied preferences as before. 

(d) If any voter “ties” certain nominees, this is equivalent to 
expressing no preference between them and everything proceeds as 
before. For example, if in (5) member D tied C, B, E there would be 
two fewer preferences for C and one fewer preference for B in (6). 

(e) In particular this method covers the case when each of a set of 
judges ranks all the objects, and not merely a preferred sub-set of them. 
The whole method, in fact, is very flexible in this respect. So long as 


47 
Member 
A BDE BDE 
B CA | CA 
C DGAB DGA 
D CBE CBE 
E AB AB 
F ACD ACD 
G BBB B Hl ‘ 
Prefer- 
ences 
for 
A 4 +3 +5 +5 = 17 
B 5 +4 | +4 +5 | =18 : 
Cc 5 +5 +4 =14 
D 4 +5 +3 = 12 
E 3 +3 = 6 
F = 0 | 
G 4 = 4 
71 |_| 
| 


48 BIOMETRICS, MARCH 1955 


any preferences are expressed we can pursue the same technique. The 
only thing to take particular care about is that one judge has the same 
opportunities as another for expressing the same number of preferences, 
even though he may not avail himself of them. We clearly introduce 
bias if we give one judge a chance to express two preferences and another 
only one. The system proposed is in accordance with the best demo- 
cratic principles in that each judge has the same number of votes, 
and all votes have the same weight. 

(f) It is possible to order the members, according to the number of 
preferences allotted to them, in a ranking (which may itself contain 
tied members). Thus we constrain a paired-comparison system into 
a ranking at the expense of violating a number of preferences. The 
fewer the violations the nearer the scheme to an actual ranking. In 
tables of the type of (2) or (4) a perfect and unanimous ranking would 
correspond to a situation in which all the non-zero cells were above the 
main diagonal. 

(g) In those cases where we choose to regard any object as compared 
with itself, as for example if we wish to complete the diagonals in (2), 
we may allot 3 to the cell in the same row and column. This will clearly 
not affect the order of the objects according to numbers of preferences 
received, for each object then receives an extra 3 for each observer. 

(h) Likewise, if an observer cannot express a preference between a 
given pair A, B-we may allot } to each of the cells in row A, column B 
and row B, column A in arrays of type (2). 

(i) We can, if we wish, give effect to differences in reliability between 
judges. For example, if in array (2) we regard D as twice as important 
in his preferences as the others, we enter 2 for each preference instead 
of unity in the table. 

8. Finally, let us note that the sian of preferences can be used to 
calculate a coefficient of agreement among judges. This is another 
aspect of the coefficient of agreement in paired comparisons proposed 
by Babington Smith and myself some years ago. (See my Advanced 
Theory of Statistics, vol. 1, chapter 16). In fact if the total possible 
number of agreements is N and the actual number of agreements is M, 
the coefficient of agreement would be simply 2M/N — 1 which varies 
from —1/m or —1/(m — 1) to 1. In table (2) for example the cells 
(A, B) and (B, A) have respectively 2 and 1 members. The pair A, B 
are compared three times and of these comparisons two are in agree- 
ment; there is thus one agreement out of a possible 3; likewise for AG, 
there are three agreements, each in the all AG, out of a possible 3. 
For the whole table it will be found that there are 47 agreements out 
of a possible 74 and the coefficient of agreement is 0.270. 


| 


PAIRED COMPARISONS 


9. We may also use.the table to calculate a coefficient of departure 
from the ranking situation. Suppose we arrange the table so that rows 
and columns follow the order of the number of preferences expressed; 
in the case of table (2) this merely amounts to interchanging the rows 
and columns corresponding to B and C. The number of units below 
the diagonal is then 13 and that above the diagonal is 50. No other 
arrangement of rows and columns can divide the 63 preferences so 
unequally. If all were above the diagonal the preferences would be 
consistent with a ranking. We might then take as our measurement of 
departure from the ranking situation the coefficient (13/63) X 2 = 
0.413. We have multiplied the factor 13/63 by two because the furthest 
situation from ranking occurs when one half of the total preferences 
are allotted to the cells below the diagonal. 

10. So much for the elements of the subject. I now proceed to 
consider sundry developments which are necessary to enable a more 
penetrating study of a paired-comparison situation to be made. The 
first arises from the nature of paired comparisons in themselves and 
may best be introduced by an example. 

Let us suppose that six players A to F are engaged in a chess tourna- 
ment in which each plays the other once. The set of scores (1 for a 
win, } for a draw and 0 for a loss) then represents a set of paired com- 
parisons made in all possible ways between them. We assume that all 


games reach a decision so that there are no missing values. A possible 
set of results is as follows: 


The simple way of arranging the competitors in order of success is to 
add up their scores, as is done in the final column. If we had three 
prizes we should divide the first and second between A and C and 
divide the third among B, F and F. Only D does not qualify for a 
share of the prize money. Such a procedure would be adopted in most 
tournaments of the kind. 


11. But we now notice one rather anomalous effect. D, the only 


49 
Total 
A B C D E F score 
3 1 1 0 1 1 4} 
0 } 0 1 1 0 23 
0 1 3 1 1 1 43 
1 0 0 } 0 0 1} 
0 0 0 1 4 1 23 _ 


50 BIOMETRICS, MARCH 1955 


player to receive nothing, has in fact beaten one of the winners, A. 
We are not allowed to dismiss this as a mere fluke, because all preferences 
are equally valid. Furthermore A has beaten C but is nevertheless 
ranked with him. Vague but genuine feelings for general equity lead 
us to inquire whether something should not and cannot be done to 
restore the balance. Such a method was suggested by Dr. T. H. Wei 
(1952) in an unpublished thesis successfully submitted to the University 
of Cambridge for the Ph.D. degree. In effect Wei’s procedure amounts 
to this: 

We recalculate a score for each player by giving him the score of 
every player he has beaten and half the score of every player with 
whom he has drawn. This leads to the following new scores: 


A = 3(44)+ 23+ 444+ 0 + 23 + 23 
© + 44+ © 

C= 0 + + B+ 

D= 44+ 0+ 0 +313+ 0 + 

E= 0 + 0 + 0 + 24 

F= 0 + B+ + + +42H = 548) 


We now arrange the players in order of new scores; and we now notice 
that A and C have separated, A being first and C second, while D has 
moved up to equality with B, E, and F. 

This is as far as one would wish to go on practical grounds, perhaps, 
but now a further point raises itself. We have re-allocated the scores 
once. Why not do so again? If we re-allocate the scores of (8) by the 
same method we find | 


A + +0 +5) + 5h = 34h 
0 +451 + 0 = 13 

= 26§ 

D = 16} 

E = 13} 
= 13} (9) 


A and C are still first and second but D takes third place and B, E, F 
share the fourth position. 


PAIRED COMPARISONS 


If we re-allocate the scores once more we find scores 
A 824.375 
365.625 
695.625 
425.625 
365.625 
365.625 (10) 


The order is now the same as we derived from (9); and if we ascertain 
new scores on the same principle we shall find that no new ordering has 


appeared. Later I shall prove that after a time the situation always 
“settles down” in this way. 


13. There are two interesting features of this procedure. Let us 
revert to the preference scheme of (7) and regard the scores as a matrix. 
If we square this matrix we obtain 


Row totals 
144 
5t 


3 
1 
2 11} 
1 
1 


(11) 


and the row totals are those previously obtained in (8) by the first 
re-allocation of scores. The reason for this will be obvious to anyone 
familiar with the rules of matrix multiplication and the result is gener- 
ally true for all preference matrices. Furthermore, if we multiply (11) 
again by the matrix (7) and add row totals we shall get the scores of (9); 
and so on. The continual re-allocation of scores is equivalent to taking 
successive powers of the matrix. 

14. Let us now consider what interpretation can be given to the 
process in terms of comparisons. The following diagram shows the 
scheme of (7) in geometrical form. The six players are represented by 
the six vertices of a regular hexagon, which are joined by straight lines 
in all possible ways. If A pref. B we draw an arrow from A towards B. 
If no preference was expressed (or the game was drawn) we do not draw 
an arrow. 


51 
14021 
122342 
a 
11023 
1021 


52 BIOMETRICS, MARCH 1955 


15. It will be seen that the score of any player in (7) is the number 
of arrows leaving his vertex, together with } (as the conventional score 
in the diagonal, when he is compared with himself) and 3 for any line 
passing through his vertex on which no arrow is drawn. When we 
proceed to the next stage we count the number of paths leaving the 


(12) 
FIGURE 1 


vertex and taking two steps. For example, for A we have the following 
paths leaving A and also leaving the vertex next visited: 


ABD, ABE; ACB, ACD, ACE, ACF; AED, AEF; AFB, AFD. 


There are ten of these ‘transitive’ preferences. We also count the 
preference of B with itself, C with itself, etc., as } each, making a 
further score of 2; and finally we score } of 3 for the double preference 
of A with itself. The total score is 14}, which is the score for A in (11). 
It may be verified that the same procedure gives the other scores in 
that array. 

Similarly the scores obtained by the next re-allocation, as given 
in (9), are the numbers of paths of three lines leaving the respective 
vertices, all arrows going the same way, with similar conventions about 
vertices taken with themselves; and so on. Our re-allocation is equivalent 
to powering the matrix or to counting paths of transitive preferences of 
increasing extent. 

16. From the geometrical viewpoint it is seen that in proceeding by 
re-allocation we are extending our concept of comparison. We began 
by considering comparisons of pairs by themselves. When we proceed 
to the next stage we compare pairs which form part of triads; but we 
do not compare the triads by considering them as three pairs (which 
would bring us back to the first situation). Thus it is possible to 


PAIRED COMPARISONS 53 


“compare” A and C by the route A or Aand Bby A> B. 
Both of these “comparisons” do not count in our score because they 
cannot both happen together; but either counts when it occurs. 

17. Or we may put it another way by saying that we compare two 
members AB not directly, but through their comparisons with other 
members, e.g. by ACB, ADB, AEB and AFB. We choose the leading 
members in the final order so as to maximize the agreement with tran- 
sitive preferences. Whether this is the right thing to do depends 
to some extent on practical circumstances. The process of continual 
re-allocation has the advantage that it results in an objective final 
ordering; but whether this is what we want depends on whether we 
are considering a situation in which direct comparison is the basic 
generator of the data, or whether we wish to give scope for more reflec- 
tive judgment in roundabout comparisons involving other members. 

18. Let us now consider the case when several judges make paired 
comparisons, or several tournaments are played between the same set 
of players. For each observer we shall have a preference matrix of 
the type of (7). To obtain a composite picture, on the supposition 
that the judges are equally reliable, we superpose the matrices. . Thus 
if (7) represents the preferences of a judge for 6 varieties of ice cream 
when offered to him in pairs, two additional judges might have the © 
following preference matrices: 


A || C D E F Totals 
} 1 0 1 1 0 3} 
0 } } 1 0 1 3 
1 } } 1 0 1 4 : 
0 0 0 } 1 3 2 
0 1 1 0 1 33 
1 0 0 } 0 } 2 (13) 
A B Cc D E F Totals 
A } 0 1 1 } 1 4 
B 1 } 1 0 0 1 3} : 
Cc 0 0 3 1 1 1 33 
D 0 1 0 } } 3 23 
E } 1 0 3 } 1 3} 
F 0 0 0 } 0 4 1 (14) 


OA BIOMETRICS, MARCH 1955 


Adding these and (7) together we get 


| | 

A R c | D E F | Totals 
A 1} 2 2 2 2} 2 12 
B 1 i 1 2 1 2 9 
Cc 1 1 1 3 2 3 12 
D 1 1 0 1} i 1 6 
E 2 1 1i 3 9} 
F 1 1 0 2 0 1} 5} 

Totals | 6 9 12 8} 12} 54 (15) 


On the basis of simple paired comparisons we should place A and C as 
bracketed equal, / as third, B as fourth, D as fifth and F as last. 

19. The question now arises whether we should re-allocate the scores 
by powering the matrix (15); or whether it would be preferable to power 
each matrix and then amalgamate the rankings at the end. The two 
processes will not always lead to identical results, although in practice 
they should not differ very much. Arithmetically it is simpler to power 
just the one matrix (15), aad in cases where there are many judges this 
would be almost decisive. This is the procedure I would recommend 
myself, but if there were any serious doubts I would perform the analysis 
both ways and compare the results. A wide disparity would, in my 
view, suggest that neither was very reliable. It would arise mostly in 
cases where there were substantial disagreements among judges. 

20. I now prove that the process of repeated powering does in fact 
converge to a limiting ranking. Dr. Wei offered a proof of the result for 
one observer and a complete set of preferences in his thesis. 

First of all we define a matrix A of non-negative elements to be 
indivisible if it cannot be expressed in the form (by rearrangement of 


rows and columns) 
0 Ag» 


If a preference matrix of type (15) is divisible in this sense the members 
of one block of objects are always preferred to every member of another. 
In such a case we divide the data into the two blocks and operate on 
each, finally ranking the members of the first group and then the mem- 
bers of the second. Similarly, if one of these blocks is itself divisible 
we divide it up; and so on. We clearly lose no generality by doing this, 
and divisibility is not a handicap in our preference situations. 


PAIRED COMPARISONS 


55 


21. I now require a theorem of Frobenius (cf. Wielandt, 1950’) 
which says that for indivisible matrices A with non-negative elements 
and positive elements in the diagonal there exists a unique simple 
positive root of the equation | A — AZ | = 0 which is greater than all 
other roots in absolute value; and that the corresponding characteristic 
vector has all its elements of the same sign (which we may take to be 
positive). 

Let \, be this largest root and Y, the corresponding vector. Then 
if \., +++ A, are the other roots and Y, --- Y, the corresponding vectors, 
and if P be the preference matrix, we have 


PY = AY (17) 


where A is the diagonal matrix 


Ay) 
It is now easy to show that for any positive integer k 
P*Y = a’Y (19) 


As the powering proceeds the major root 4, becomes dominant and 
(19) tends to the equation 


PY, = MY, 


(20) 


Thus from some k onwards the final ordering will be determined by the 
vector Y, , which has non-negative elements. 

22. We notice that the proof remains applicable to preference 
matrices in which some preferences may be missing, or when ties are 
present, provided that the matrix is not divisible. If any cell in a 
combined preference matrix contains no entries we insert a zero. 

23. It is also of some interest to note that we may prove that the 
preference matrix is never singular. In fact, we can always express it 
(apart from positive numerical factors) in the form 


(Q + U) 


(21) 


2I am indebted to Professor A. C. Aitken and Dr. F. G. Foster for some references on this subject. 


The preference matrices are similar to, but not identical with, the matrices of transition probabilities 
studied in the theory of stationary stochastic processes. 


_ hee 
A= (18) 
|_| 
|_| 


56 BIOMETRICS, MARCH 1955 


where Q is an anti-symmetric matrix and U is the matrix all of whose 
elements are unity. For example (15), after division of rows by 1}, 
can be expressed as U plus the matrix 


0 0 4} 
-} -1 0 0 -} 
0 
~§ -1 3-1 


We reduce Q + U systematically by subtracting the first column 
from the second column, then the first row from the second row; then 
the first column from the third column, then the first row from the 
third row; and so on. The effect on Q is to reduce it to another anti- 
symmetric matrix, say Q’, and the effect on U is to reduce it to a unit 
in the top left-hand corner and zero elsewhere. Thus the determinant 
of Q + U is the determinant of Q’ plus the determinant of the principal 
minor obtained by omitting the first row and column, which is also 
antisymmetric. 

Now the determinant of p X p antisymmetric matrix is zero if p is 
odd and positive if p is even. Hence the determinant of Q + U is the 
sum of two components, one zero and the other positive; and hence it 
does not vanish. 

22. In practice the number of paired comparisons arising from n 
objects may be inconveniently large and the question arises whether it 
is possible to economize in the number of comparisons made. In the 
example of the chess tournament which has been mentioned above 
(paragraph 10) if each player is to play every other, 15 games must be 
played. But only three can be conducted at once, so at best 5 sessions 
are necessary. If this is too long, and, say, three sessions are all that 
can be allowed, only nine games can be played and six have to be 
sacrificed. The question is, which six? Or again, if an individual is 
comparing items by taste testing, his patience or his palate may not 
endure the presentation of all the possible pairs, and a problem arises 
as to how best to cut down the number of pairs and which pairs to 
present. 

23. Problems like this arise in many fields of experimentation and are 
usually dealt with by incomplete balanced blocks. Some new points, 
however, arise in paired-comparison work. Durbin (1951) has considered 


PAIRED COMPARISONS 57 


the use of Youden designs in ranking experiments. More recently 
Benard and van Elteren (1953) have discussed tests of significance 
where incomplete rankings are concerned. Without trying to exhaust 
the subject I proceed to consider the use of incomplete balanced blocks 
in preference schemes. 

24. Consider first of all the case of a single observer. Of the 
n(n — 1)/2 preferences which he could make we require to pick out a 
sub-set. Certain elementary principles of choice at once suggest 
themselves: 

(a) every object should appear equally often. In this sense the 
design should be balanced; 

(b) the preferences should not be divisible in the sense that we 
can split the objects into two sets and no comparison is made between 
any object in one and any object in the other. 

In terms of preference matrices (a) means that there should be the 
same number of non-empty items in each row and column; (b) means 
that the matrix does not divide into two blocks and become of the form 
(3 y) when the zeros represent empty cells. In terms of the prefei- 
ence diagram (a) means that there are the same number of paths direct 
between points leaving or entering each vertex and (b) means that the 
figure does not separate into two distinct polygons. 

25. When possible I add a further condition of symmetry to the 
situation, that is to say 

(c) In the preference diagram the number of paths of length 1 
proceeding from any point to any other point shall be the same for all 
pairs of points. 

The length / here means the number of lines traversed in the path, 
e.g. the path (in Figure 1, section 14) ABC from A to C is of length 2 
and AEBDC from A to C is of length 4. Where no pair of objects is 
compared in these “partial” situations we omit the line between them. 
If they are joined by a line without an arrow this means that they have 
been compared but that no preference has been expressed. 

In terms of preference matrices this condition implies a kind of 
symmetry of interlocking. A path ABC implies entries in row A, 
column B and column C (and the reflections column A, row B and row 
C); and analogous entries must occur in other rows in such a way that 
all the objects are symmetrically involved. 

26. Under these conditions we can meet a requirement suggested 
to me in conversation by Dr. R. C. Bose: if all the preferences are 
exerted at random (e.g. if we toss up for it which of a pair shall be 
preferred) all possible final orderings of the objects produced by powering 
the matrix should be equally probable. This follows from the symmetry 


58 BIOMETRICS, MARCH 1955 


of the situation, for we can interchange two objects in the designs 
without altering the preference matrix, so far as concerns the underlying 
probabilities, and all final orders are therefore equally probable. 

27. In a sense, it seems to me, condition (c) is necessary as well as 
sufficient for a proper design. If it is not obeyed certain objects become 
subject to different schemes of preference from others and their final 
positions are not determined on an unbiased basis. In terms of powered 
preference matrices, the sums of rows are not based on the same number 
of transitive comparisons of length ?. 

28. The conditions laid down above impose certain weteiations on the 
scope of a paired-comparison experiment. For instance, if there are six 
objects and the numbers of entries in the rows of the preference matrix 
are equal, the number of comparisons necessary to obtain a balanced 
experiment must be a multiple of three. Anything else destroys the 
balance. The connectivity condition (b) further limits the freedom of 
choice; for example, with six objects at least six comparisons are required. 

29. The setting up of incomplete designs is most easily thought of in 
terms of tours round the preference polygon. Consider the case n = 7. 
(Prime numbers are easier to deal with in most experimental designs.) 
There are 21 comparisons altogether. To obtain a balanced design 
we must have either 7 or 14 comparisons (or, of course, the full 21). 
The first 7 may, without loss of generality, be taken as the tour 
ABC --- G@ round the preference heptagon. (No generality is lost 


A B 


FIGURE 2 


because each member must be connected to two others and hence 
they are on a chain which may be taken to be the order A to G.) For 


PAIRED COMPARISONS 59 


the next 7 we have two possibilities: (a) start from A, miss a vertex 
and go to C, miss a vertex and go to E and so on; (b) start from A, 
miss two vertices and go to E, then two vertices and go to G and so on. 
We do not obtain new designs by tours missing three or more vertices 
because they are equivalent to (a) or (b). The two schemes are shown 
in Figure 3. 


A B a 


FIGURE 3 


These schemes are not identical. In the former there are two 
triangular tours connecting any pair, e.g. ACB and AGB, whereas in the 
second there is only one, e.g. AEB. In terms of time taken in perform- 
ance there is nothing to choose between them. For example if they 
represented a chess tournament, each round requires three games, 
one player having a bye, and for 14 games 5 rounds are required. Such 
might be 


Scheme 1 Bye Scheme 2 Bye 
AB, CD, EF G AB, CD, EP, G 
AC, BD, EG F AD, BC, FG, E 
BC, DE, FG A BE, CF, DG, A 
AF, CE, BG D AE, BF, CG, D 
AG, DF B,C, E | AG, DE B,C, F (23) 


30. It remains to be considered whether one scheme is preferable 
to the other by some other criterion. There is nothing to choose between 
them in relation to balance or the application of the powered-matrix 
method. We note, however, that the patterns of transitive preferences 
are different. In the first any pair is connected by two triangles, three 


60 BIOMETRICS, MARCH 1955 


quadrilaterals, etc., in the second by one triangle, four quadrilaterals, 
etc. On the whole, I should be inclined to select the second design 
from a feeling that it has higher connectivity, but an exact criterion 
awaits further investigation. 

31. When we have several judges, an obvious extension of symmetry 
requirements necessitates that each participates to an equivalent extent: 
in some sense the design should be balanced by judges as well as by 
comparisons. Something depends on whether we require to compare 
iudges in addition to objects. If so, each pair of judges must have 
certain comparisons in common. With two judges and seven objects, 
for example, one simple way «ould be to allot to each 14 comparisons, 
one judging according to each of the designs of Figure 3. They would 
then have 7 comparisons in common and all possible comparisons 
could be made. 

32. I do not propose on this occasion to attempt a systematic 
exposition of the design problems involved in paired comparisons. 
Designs of an optimum kind which balance by numbers of comparisons, 
objects compared, numbers of observers on given comparisons and so 
forth are probably rather rare; and if something has to be sacrificed 
it depends on what is the point of major interest whether we sacrifice 
symmetry in comparisons or in judges. A final example will make 
clear a few of the principles involved. 

Consider again the case of seven objects, ABCDEFG. There are 
three distinct tours round the preference polygon, 


ABCODEFG 
AC EGBODF (24) 


Each tour involves seven comparisons and each object is compared 
with two others in a tour. 

For a complete set of comparisons each observer would have to 
make 21. If this is felt to be too much we may allocate 14, consisting 
of two tours each. And if the tours are represented by a, b, c, we may 
allocate to the observers 1, 2, 3 


1: a, b 
(25) 


With these schemes every comparison is made equally often (twice); 
every tour is made equally often (twice); every observer makes the 


PAIRED COMPARISONS 61 


same number of comparisons (14); every observer has a tour in common 
with every other observer; and thus every observer can be compared 
with every other observer in respect of two comparisons involving any 
specified object. 

If we have more than three observers, we take a number equal to a 
multiple of three and replicate the design. 

Now suppose we had eleven objects, A to K. The full set of compari- 
sons numbers 55. There are five distinct tours round the preference 
polygon 


aABCoODBEFGHI JK 
eAbDGJIS EHKCFRiI (26) 
avr CGE RDA 


Now if we try to allot two tours to each of five observers we lose sym- 
metry; for there are 10 pairs of tours choosable from these five. We 
have, to preserve complete balance, to allot four tours to each observer 
1, 2, 3, 4, 5 


1 
1 
4 
5 


(27) 


Again the tours are balanced, but we have not achieved very much. 
Each observer now makes 44 comparisons, against the full set of 55. 

We can sacrifice symmetry in several ways. We may, for instance, 
allot two tours to each observer, e.g. 


3: c, d 

5: a (28) 


62 BIOMETRICS, MARCH 1955 
Here every observer can be compared with two other observers but not 
every pair can be compared. Or if we have, say, 10 observers we may 
allot all the 10 possible pairs of tours one to each. Each observer then 
makes 22 comparisons and can be compared with four other observers. 
If 22 comparisons are still felt to be too many for one observer we may 
allocate the 55 preferences according to a linked design, e.g. (numbering 
the preferences 1 to 55) with 11 observers, 10 preferences each 


3: 2, 11, 20, 21, 22, 28, 24, 25, 26, 27 
4: 3, 12, 20, 28, 29, 30, 31, 32, 33, 34 
5: 4, 13, 21, 28, 35, 36, 37, 38, 39, 40 
6: 5, 14, 22, 29, 35, 41, 42, 48, 44, 45 
7: 6, 15, 23, 30, 36, 41, 46, 47, 48, 49 
8: 7, 16, 24, 31, 37, 42, 46, 50, 51, 52 
9: 8, 17, 25, 32, 38, 43, 47, 50, 58, 5&4 

10: 9, 18, 26, 33, 39, 44, 48, 51, 53, 55 

11: 10, 19, 27, 34, 40, 45, 49, 52, 54, 55 (29) 


Here we have cut down the comparisons for each observer to 10 and 
each comparison is made twice. But we have lost a good deal of the 
comparison between judges; every judge can be compared with every 
other judge but only on one comparison of objects. 


REFERENCES 


Benard, A. and Van Elteren, Ph. (1953), A generalization of the method of m 
rankings. Kon. Neder. Ak. van Weterschappen. A, 56, 358. 

Durbin, J. (1951), Incomplete blocks in ranking experiments. Brit. Jour. Psych. 
4, 85. 

Frobenius, G. (1912), Uber Matrizen aus nicht negativen Elementen. Sitz. Preuss. 
Akad. Wiss., 456. 

Wei, T. H. (1952), The algebraic foundations of ranking theery. Unpublished 
thesis, Cambridge University, England. 

Wielandt, H. (1950), Unzerlegbare, nicht negative Matrizen. Math. Zeit., 52, 642. 


COMPARATIVE SENSITIVITY OF PAIR AND TRIAD 
FLAVOR INTENSITY DIFFERENCE TESTS’ 


J. W. Hopxins anp N. T. GrinGEMAN 


Division of Applied Biology, National Research Council, 
Ottawa, Canada 


INTRODUCTION 


Alternative simple experimental designs for sensory difference tests 
of flavor intensity lead to the procedures termed “pair”, ‘‘duo-trio”’ 
and ‘triangular’ tests (3). In the first, a unit trial consists in sub- 
mitting coded aliquots of the two batches in question to a subject in 
the sequence (X), (Y) or (Y), (X) and requiring him to rank them in 
the order of appraised flavor strength. In the second, it consists in 
submitting identified X or Y with the coded sequence (X), (Y) or 
(Y), (X) and requiring the subject to attempt to match the identified 
with the like coded aliquot. In the third, it consists in submitting 
one of the completely coded sequences (X), (X), (Y); (X), (Y), (X); 
(Y), (X), (); (Y), (Y), (FY), (X), (¥) or (X), (Y), (Y) and again 
requiring an attempted matching of like aliquots. Inferences respect- 
ing the occurrence or non-occurrence of real discrimination are then 
made by relating the actual frequency of ranking or matching in re- 
peated trials to percentage points of the binomial distribution expected 
in the absence of discrimination. 

It has been suggested (4) that the “triangular” test is “obviously 
the most efficient” but experimental evidence to the contrary has been 
reported (1). This note indicates some statistical considerations relevant 
to efficiency comparisons, and applies them to additional data. 


STATISTICAL CONSIDERATIONS 


At a nominal significance level of 5% for a “pair” test n-replicated, 
the appropriate critical region for rejection of the null hypothesis that 
sensorily X = Y will comprise all x at or below the effective 2.5% and 
at or above the effective 97.5% points of the cumulative binomial 
distribution (6, 7) of x for n and po = 1/2. Here po is the chance 
probability on the null hypothesis and x the recorded frequency of a 
specified ranking, e.g. of (Y) above (X). For the “duo-trio” and 
“triangular” tests, which involve only matching without ranking, the 
corresponding critical regions will comprise all points x at or above the 
effective 95% points of the cumulative binomial distribution of x for n 
and p) = 1/2 and p, = 1/3 respectively. Here po is the probability 


IN. R. C. Publication No. 3529. 


63 


BIOMETRICS, MARCH 1955 


P, 
"PAIRS" ---- "DUO TRIOS” —-— "TRIANGLES" —— 


FIG. 1 


Power of “pair”, ‘‘duo-trio” and “triangular” flavor intensity appraisals for detection of Y > X when 

nominal significance level a = 0.05 in relation to the probability pi of genuine sensory discrimination: 

(A) for equal numbers of replicates n = 21 and n = 99; (B) for equal numbers of aliquots N = 42 and 
N = 198. 


on the null hypothesis and z the recorded frequency of matching like 
aliquots. 
In the presence of a marginal difference having a constant probability 
p, of sensory recognition, the probability p of ranking (Y) above (X) 
in a “pair’’ test will be the sum of p, and of the conditional probability 
of chance guessing after failure to discriminate. Hence p = p, + 
(1 — p:)/2 orp = 1 — p, — (1 — p,)/2, i.e. p = (1 + p,)/2, according 
as the intensity of X > Y orof Y < X. For “duo-trio” and “triangular” 
tests the probability of matching like aliquots will now correspondingly 
be p = (1 + p,)/2 and p = (1 + 2p,)/3 respectively. Fig. 1A depicts 
the resulting power at nominal a = 0.05 of these three tests of X ¥ Y, 
i.e. the probability 1 — 8 that 2 will fall in the critical regions specified, 
as p, ranges from 0 to 1 when m = 21 and whén n = 99. For corre- 
sponding 0 < p, < 1, the power order is “pair” < “duo-trio” < 
“triangular”. However for equal numbers N of appraised aliquots 
(Fig. 1B) the order is “duo-trio” z “pair” < “triangular”, the power 
of “pair” relative to “duo-trio” tests varying with N and p, partly 
because of differences between the nominal and effective percentage 
-points of discrete distributions. Unfortunately, in practice p, is un- 
specifiable a priori. An assumption of equal p, in all three types of 
test for an identical flavor contrast or the consequences of inequalities 
in p, must be tested experimentally in specific instances. 


64 
1.0 
7 
n=99 44 
oe 
“7 
N=42 
0.4 
/ 
¢ oe 
¢ 
° 0.2 0.4 0.6 0.8 0.2 0.4 06 86° (08 


FLAVOR TESTS 65 


EXPERIMENTAL DATA 


Three parallel trials were made in the writers’ laboratory. Slight 
modifications of the flavor intensity of an aqueous solution generating 
a mixture of the four primary tastes (Trial A), of tomato juice (Trial 
B) and of minced steak (Trial C) provided three flavor contrasts of 
differing complexity. These were each appraised under standardized 
conditions by six experienced subjects in 18 replicate “pair”, ‘‘duo-trio”’ 
and “triangular” discrimination tests, of which there were thus 972 in 
all. Sequences of presentation of the various coded aliquot pairs and 
triads to each subject were randomized subject to the condition that 
each of the two possible coded pair sequences occurred equally fre- 
quently, likewise each of the four “duo-trio” sequences X, (X), (Y); 
X, (Y), (X); Y, (X), (Y) and Y, (Y), (X), and likewise each of the 
six triad sequences enumerated above. 

Table I summarizes the results obtained. In all nine tests the 
recorded total frequency of specified rankings or matchings exceeded its 
no-discrimination expectation of 54 for the “pair” and “duo-trio” and 
of 36 for the “triangular” tests. 


TABLE I. 
Recorded Frequency of Specified Ranking and Matching of Aliquots in (1) “‘Pair’’, (2) ‘‘Duo-Trio” and 
(3) “Triangular” Flavor Tests 


Trial Subject 

and Total 
test I II Ill IV V VI z 
Al 9 15 9 14 13 15 75 
A2 14 11 9 8 13 10 65 
AZ 5 8 12 8 8 9 50 
B.1 10 ll 9 10 14 7 61 
B.2 ‘ f 9 10 12 11 9 58 
B.3 5 5 9 6 6 a 38 
C.l 12 12 14 13 12 9 72 
C.2 10 9 10 ll 13 13 66 
C.3 6 8 12 10 8 9 53 


ANALYSIS OF DATA 
Intra-test homogeneity 


Calculated indices of dispersion (Cochran’s (2) Q), appropriate to 
repetitive data for the same individuals (5), were entirely consistant 
with inter-replicate stability of sensory discrimination by the group of 


66 BIOMETRICS, MARCH 1955 


subjects as a whole. Moreover, the individual frequencies of specified 
rankings and matchings listed in each row of Table I, when arrayed 
together with their complements in nine 2 X 6 contingency tables, 
gave an aggregate index of inter-subject intra-test homogeneity of 
x’ = 45.9 with 9 X 5 = 45 d-f. In this instance therefore it is also 
reasonable to assume that p was sensibly the same for all six subjects 
in any one test and trial. 


Inter-test differences 
The logarithm of the likelihood of any recorded x for “pair” and 
“duo-trio” tests will be: 


log L = z log (Lp. + (n — 2) log (158), 


while for “triangular” tests 


log L = zx log (1+ 22) + (n — 2) log (2= 28), 

Hence maximum likelihood estimates #, of p, > 0, specified by equating 
8 log L/dp, to zero, will result from (2x — n)/n for “pair” and “duo- 
trio” and from (8x — n)/2n for “triangular” tests. “Pair” and “duo- 
trio” tests in which z < n/2 and “triangular” tests in which x < n/3 
provide no internal evidence of p, > 0. As #, is a linear function of 
p = x/n, the random sampling variance of the former will be V(j,) = 
4V(p) for “pair” and “duo-trio” and 9V(p)/4 for “triangular” tests; 
and with n = 108, V(#) may be estimated with reasonable confidence 
from #(1 — #). From the marginal totals of Table I, the following 
result. 


Test 
Trial 
“Pair” “Duo-trio” “Triangular” 
A .39 .20 19 
B 13 .07 .03 
Cc .33 .22 24 
Average .28 .16 15 


The difference of 0.125 between the mean #, for the “pair” and that for 
both the 3-aliquot tests is 1.97 times its estimated standard deviation 
of 9.0635. The mean #, for the “duo-trio” and “triangular” tests 
evidently do not differ significantly. 


FLAVOR TESTS 


0.20 -— paw 
0.10 \ 198 
4, = 


| 


fe) 0.5 10 0 0.5 1.0 
"TRIANGULAR" p, 


FIG. 2 


Increment Ap; in the probability p: of genuine sensory discrimination required to equalize the power o f 
“pair” and “triangular” flavor intensity appraisals, in relation to ‘‘triangular’’ pi : (A) for equal 
numbers of replicates n = 21 and n = 99; (B) for equal numbers of aliquots N = 42 and N = 198. 


Power Effects 


Discriminatory powers for X ~ Y specifically attained in these 
experiments cannot be estimated with exactitude, because of the 
imprecision in # = 2/n with n no larger than 108. However, since all 
three trials were consistent with identical p, for ‘“duo-trio” and “tri- 
angular” appraisals, the relative sensitivity of these may be inferred 
from Fig. 1 and the preceding #, . For p, the same as the listed #, the 
comparative powers 1 — @ for detecting Y > X with equal numbers n 
of replicates and N of aliquots, and nominal significance level a = 0.05, 
would be: 


“Pair” “Triangular” “Pair” “Triangular” 


A -40 29 1.00 79 67 
B -08 02 0.16 08 .09 
Cc 30 38 .19 1.00 92 82 


Fig. 2 illustrates the increment A p, required for equipotency of “‘pair”’ 
and “triangular” tests as a function of “triangular” p, in the equal 
replicate and equal aliquot instances exemplified above. Ordinates 
of the curves depicted in this figure correspond to abscissal distances 
between power curves in Fig. 1. 


67 
n = 21 n = 21 n=14 n = 99 n = 99 n = 66 
N = 42 N = 63 N = 42 N =198|} N = N = 198 
mn 


BIOMETRICS, MARCH 1955 


DISCUSSION 


For corresponding p, , “triangular” tests have a statistical advantage 
over “‘duo-trios” and “‘pairs’’, both per replicate and per aliquot. How- 
ever, the preceding experimental results, together with those of Byer 
and Abrams (1), suggest that in some instances at least p, may in 
fact be greater in “pair” appraisals, possibly because fewer inter- 
comparisons are required. The data also suggest that such discrimina- 
tory superiority may sometimes more than offset the statistical advan- 
tage per aliquot of “triangles’’. 

Man-hours devoted to flavor appraisals are not all spent in actual 
tasting. Appreciable aggregate amounts of time may also be required 
to schedule, assemble, instruct and return subjects to their own working 
quarters. These are largely independent of the number of aliquots 
appraised per session. When the latter is small therefore the relative 
power per man-hour of “pair” and “triangular” tests may be inter- 
mediate between their relative powers per replicate and per aliquot. 
In large-scale testing, and whenever test materials are scarce or costly, 
relative power: cost ratios will approximate more closely to the latter. 
Also, matching tests may be applicable to the detection of qualitative 
differences for which ranking is inappropriate. Factors such as these, 
as well as purely statistical considerations, may accordingly influence 
a rational choice between pair and triad tests for specific applications. 


ACKNOWLEDGMENTS 


The writers are indebted to Mrs. G. E. Tyler for technical supervision 
of tasting experiments, and to the six participants for sustained conscien- 
tious collaboration. 


REFERENCES 


1. Byer, A. J. and Abrams, D. A comparison of the triangular and two-sample 
taste-test methods. Food Technology, 7: 185-187, 1953. 

2. Cochran, W. G. ‘The comparison of percentages in matched samples. Biometrika, 
37: 256-266, 1950. 

3. Dawson, E. H. and Harris, B. L. Sensory methods for measuring differences in 
food quality. Agric. Information Bull. 34, U.S. Dept. of Agriculture, Washington, 
D. C., 1951. 

4. Harrison, 8. and Elder, L. W. Some applications of statistics to laboratory taste- 
testing. Food Technology, 4: 434-439, 1950. 

5. Hopkins, J. W. Some observations on sensitivity and repeatability of triad taste 
difference tests. Biometrics, 10: 521-531, 1954. 

6. National Bureau of Standards. Applied Mathemati ‘eries 6, Tables of the 
Binomial Probability Distribution. U.S. Gov’t. Printing Office, Washington, 
D. C., 1950. 

7. Romig, H. G. 50-100 Binomial Tables. John Wiley and Sons, Inc., New York, 
1953. 


68 


THE DESCRIPTION OF GENIC INTERACTIONS IN 
CONTINUOUS VARIATION 


B. I. Hayman anp K. MatTHER 


A.R.C.’s Unit of Biometrical Genetics, 
Department of Genetics, University of Birmingham 


The genetical interpretation of the continuous variation (or indeed 
any variation) shown by a population, family or group of families 
requires the use of specifications of two distinct kinds. Firstly, it is 
necessary to specify the genetical structure of the population, family 
or families. In principle, this requires the specification in suitable 
terms of the relative frequencies of the various alleles of the genes 
involved, the distribution of the alleles at a locus between the various 
possible homozygotes and heterozygotes and the distribution of the 
alleles of different genes in respect of one another. These specifications 
will depend on the ancestry of the material, the mating system which 
‘has been in force, the selection which has been practised (if any), and 
the linkage or other relation of the genes in transmission from parent 
to offspring. Specification of the genotype of every individual, or 
indeed of any individual, is not essential for most biometrical purposes 
so long as the relative frequencies of the different possible genotypes 
can be given, and indeed it is sufficient for many purposes to specify 
only the average, taken over all genes, of the allele frequencies, homozy- 
gosis, linkage relations and so on. 

Secondly, it is necessary to specify the relations between genotype 
and phenotype. In principle, this requires specification of the effect 
of each gene substitution on the character or characters in question, 
the dominance relations of the genes, the relations in effect of non-allelic 
genes (genic interaction), the effects of non-heritable agencies, and the 
relations in effect of genic and non-heritable agencies (genotype-environ- 
ment interaction). Specification of the effects of heritable but extra- 
nuclear particles may also be required in special cases, but experience 
shows that these may generally be neglected. Like the genetical 
specifications, the specifications of effect need not, for most biometrical 
purposes, be individually detailed. It will suffice for many purposes to 
specify only the effects of gene substitution, dominance and genic 
interaction, each pooled over all genes, and the pooled effects of all 
non-heritable agencies and their interactions, so that neither the 
individual genes need be isolated nor the non-heritable agencies 
separated. 


69 


70 BIOMETRICS, MARCH 1955 
Given these two sets of specifications, the phenotypic properties 
(which are the properties capable of being observed) of the population 
can be predicted prior to observation; individually for each member 
of the population if the specifications are individually detailed, or 
statistically for the whole population if, as is usually the case, the 
specifications are statistical. We are, however, more commonly con- 
cerned with deriving the specifications from the observed properties 
of the material. This is clearly impossible without some knowledge 
of the genetical relations or breeding behaviour of the individuals 
whose phenotypes are observed. Generally it has been found convenient 
to set up the experiments in such a way that certain genetical specifi- 
cations can be reasonably assumed. Thus if we start with a cross 
between two true-breeding strains of plants and proceed thereafter by 
self-pollination, all precautions being taken to avoid selection, we may 
assume that only two alleles are present at any locus and that the rise 
of homozygosis will follow the rule Mendel first enunciated, both within 
and between the lineages derived at various stages of the experiment. 
Linkage relations remain to be inferred from the observations them- 
selves, and if the possibility of selection has not been eliminated, it 
may be no easy matter to distinguish between the effects of linkage 
and selection (Bateman and Mather 1951). Other systems of mating— 
sib-mating, backcrossing, diallel crossing and so on—may be, and 
have been, used for the same purpose; and inversions may be used so to 
reduce recombination within chromosomes that the linkage relations 
are simplified to an extent where they may be reasonably assumed. 
The basic principle of the approach remains the same in all these cases, 
and it depends for its success on the demonstration that all but a 
negligible fraction of the heritable component of continuous variation 
springs from nuclear genes, whose behaviour in transmission is under- 
stood from other types of genetical investigation. In this way the 
genetical study of continuous variation rests on the foundation provided 
by mendelian genetics in all its complexity and strength. Where 
determination is extra-nuclear, the genetical specification alters, and 
becomes less certain. It may indeed then cease to be a matter for 
confident assumption and become one for investigation and inference. 
The specification of effect is seldom if ever capable of the same 
precise assumption, for the reason that no generalisations of a precision 
and breadth of application comparable to those of the chromosome 
theory of heredity are available in respect of gene action. True, we 
are guided to the extent that we must bargain for genic interactions, 
both allelic in the form of dominance and non-allelic in the form of 
epistasis, and also genotype-environment interactions. In this way 


GENIC INTERACTIONS 71 


we are told the broad classes into which the variation, or rather its 
causation, must be partitioned; but we do not know in detail what to 
expect. Many types of interaction may exist side by side and we have 
no means of anticipating any one type or any mixture of types. Specifi- 
cation of effect is thus one of our regular and prime tasks of inference in 
interpreting continuous variation. One general tool we do have, 
however. The specification of effect will vary with the scale on which 
the character is measured. It is therefore assumed that this scale 
has been chosen to minimise the various types of interaction. Tests 
are available of the validity of this assumption (Mather 1949a). There 
can be no certainty, however, that a scale exists on which all inter- 
actions will vanish, and indeed we have evidence in particular cases 
that while scaling may reduce, it cannot wholly remove, the interactions 
that are present. Any comprehensive consideration of the specification 
of effect must therefore take into account interactions between non- 
allelic genes. In attempting such consideration our first task is clearly 
that of arriving at a suitable way of describing and classifying the 
interactions. It is with this first aspect of the problem that we are 
concerned in the present account. 


The Description of Interactions 


In diploid organisms the individual can fall into any one of three 
genetic classes (AA, Aa and aa) in respect of a gene for which there exist 
two alleles (A and a).* Two independent comparisons are possible 
among three classes. The effect of the gene difference on the phenotype 
can thus be described completely by two parameters, and specified 
completely if the values of these two parameters are known. Statisti- 
cally the pair of parameters may be defined in a variety of ways, but 
these will not all be of equal value in geneticai analysis. In the system 
adopted by Fisher e¢ al. (1932) and Mather (1949 a and b), one parameter 
(d) is used to represent the phenotypic difference between the two 
homozygotes, AA and aa, and the other (h) to represent the departure 
in phenotype of the heterozygote, Aa, from the mid-point between 
AA and aa. Taking this mid-point as the origin, the effects on the 
phenotype are then 


aa Aa AA 
—d h d 


so that the gene’s contribution to the fixable genetic variation is pro- 


*A is used to denote the allele tending to increase the manifestation of the character, and not, as is 
conventionally the case, to denote the dominant ellele. The direction of dominance is indicated by the 
sign of the parameter h. 


72 BIOMETRICS, MARCH 1955 


portional to d, while h reflects the dominance properties of the gene and 
represents the contribution to the unfixable heritable variation. At 
the same time, the contributions of d and h to the heritable variation 
will be statistically independent so long as the two homozygotes are 
equally frequent in the population or families. When this condition 
is not fulfilled, their contributions to the variation will be partly 
confounded. 


With two gene differences, nine genotypes* are possible, and eight 


TABLE 1 
AA Aa aa 
da he —d, 
dz + dy h+d - —d, +d 
BB + tab) — Tab} 
— — + + — 
+ das — + ilies 
da + hs ha + he +h 
Bb 
hs + — 
d, — dy ha — dp —d, — d 
bb — tad} + tap} 
— + — + + 
+ ile — + ila 


The phenotypes associated with the nine genotypes in respect of two interacting genes. 


parameters must be used to give a complete description of the pheno- 
types. Four of these will be the d’s and h’s appropriate to the two 
genes, as shown in the margins of Table 1. The other four may then 
be derived conveniently to correspond to the “interaction” comparisons 


*With two linked genes, there are ten genotypes because the double heterozygotes fall into the two 
classes AB/ab and Ab/aB. Generally, however, these genotypes give a common phenotype so that the 
distinction of linkage phase need be pursued no further in our present discussion. 


GENIC INTERACTIONS 73 


of an analysis of variance where the @’s and h’s correspond to the “main 
effects”. The distribution of these four parameters among the nine 
genotypes are shown in Table 1. They fall into three classes. One of 
these, 7,,, is the interaction of d, and d, and may be termed the homo- 
zygote-homozygote interaction. Two others, j,;, and j,;. are the 
homozygote-heterozygote interactions, respectively, of d, and h, , and 
d, and h,. the last, l;,, , is the heterozygote-heterozygote interaction 
of h, and h,. The coefficients of } and } are applied to the j’s and / 
respectively so that equal contributions will be made to the overall 
differences in an F, family by interactions of unit size. The double 
frequency of heterozygotes in an F, also makes it unnecessary to vary the 
coefficients of j and / from cell to cell of the table. 

The four interactions, as defined in this way, have clear genetical 
meanings, though they do not follow the conventional genetical classi- 
fication of interaction between non-allelic genes. All the classical types 
of interaction may, however, be cast in terms of 7,7 andl. The standard 
mendelian F, segregation into four phenotypic classes with frequencies 
9:3:3:1 occurs when d, = h, , d, = hy, and t.5; = join = Joie = lies - 
Thus although this type of F, is classically regarded as showing no 
interaction of the genes, interactions may be present within certain 
restrictions. If we add the further condition that d, = 37,,; we obtain 
the 9:3:4 ratio characteristic of recessive epistasis. The further con- 
dition that d, = d, , then gives the 9:7 ratio of complementary genes. 

Going back to the standard F, , the additional condition d, = —}%.5, 
gives the 12:3:1 ratio of dominant epistasis; while the addition of the 
still further condition d, = d, gives the 15:1 ratio of duplicate factors. 
Again, if instead of this last condition we put d, = —43d, , the 13:3 
ratio of the recessive suppressor relation is obtained. Indeed any 
interaction of two genes can be achieved by imposing appropriate 
conditions. For example, a situation which might be described ‘as 
dominant dominance modification results from putting d, = 2h, = 
2d, = 2hy = = and = joj, = 0. These various relations 
are shown diagrammatically in Table 2. 

The representations of all the classical types of interaction in terms 
of the same parameters, 7, j and 1, permit their combination in analysis, 
so that it becomes possible to consider any number of genes with many 
diverse interactions between them without any further elaboration. 
Each will contribute in its own way to the 7, 7 and / components of 
variation and so long as we can discover how these components change 
from generation to generation we can give an average, or statistical, 
account of the interactions and their effects on variation without having 
to aim at any individual classification. Furthermore, as we shall see 


74 BIOMETRICS, MARCH 1955 


TABLE 2 


The classical F; and six types of classical digenic interaction in terms of d, h, 7, 7 and l, 


CLassIcaAL F2 Dom. Dom. MopiFIER 


dg=he d =h dg = Qhe = 2dy = 2hy = = 1 
t = = = = = O 
AA Aa aa 
d, + dy 
BB 
+t +d, 
Bb 
bb] — i —id, 


Dom. 


Rec. Epistasis 


d, = —}i = 3i 
+ dp 
éda + dy 
+d, 


Dup.icaTE GENES 
d, = dy 


Rec. SupPRESSOR 
d, = — 4d, 


CoMPLEMENTARY GENES 


d, = dy 


Relations among those parameters which yield the Classical F: (top left) and the 
Dominant Dominance Modifier (top right) interaction are shown in full. The re- 
maining classical interactions are derived from the Classical F, by the addition of 
further relations between the parameters as shown above each square. The class 
phenotypes are shown within the squares. 


GENIC INTERACTIONS 75 


below, the three categories 7, 7 and IJ, have their own properties of effect 
and change with the generations, so that the classification into these 
catagories is, in principle, sufficient to enable us to understand, estimate 
and predict, the effects of interactions between pairs of genes. 

This system of classification can be extended to interactions between 
three or more genes. With trigenic interactions we should recognise 
four categories, hom-hom-hom, hom-hom-het, hom-het-het and het- 
het-het. So four new types of parameter would come in to the analysis, 
though two of the types would each include three individual parameters, 
making eight in all. To describe the phenotypes of the 27 genotypes 
produced by three genes requires 26 parameters. Of these 18 are already 
available, 6 from the two parameters describing the main effects of 
each of three genes, and 12 from the four parameters describing the 
digenic interactions among the three pairs possible with three genes. 
The 8 parameters required for the trigenic interactions complete the 
tally. 

The phenotype is found as the algebraic sum of all the parameters 
associated with the genotype in question (Table 1). So the sum of the 
“main effect”? parameters, d and h, gives a first approximation to the 
phenotype—one which neglects all interactions. Thus for the genotype 
AABB in Table 1, we should have d, + d, as this first description of 
the phenotype. Moving to the next level of approximation by ad- 
mitting digenic interactions (which in this simple two gene model is 
the final approximation giving a complete description) we define the 
phenotype as d, + dy + — (Jars + Joie) lias. With a polygenic 
model we can obtain a next approximation by bringing in the eight 
parameters for trigenic interactions and so on. These successive 
approximations might, however, be expected to become of less and less 
advantage. Most of the variation will generally (though not, of course, 
necessarily) be accounted for by the “main effect’? parameters, most 
of the rest by the parameters for digenic interactions and so on. There 
will thus be little justification for considering the more complex inter- 
actions until the digenic type has been fully explored. 


The Effects of Interactions 


The contributions of the two interacting genes to the mean expression 
of the character in the various generations derivable from a cross 
between two true-breeding lines, are shown in Table 3. All increments 
are measured from the mid-parent value, which is of course the mean 
of the expression in the two parental lines. Two crosses are possible, 
in respect of the two genes: one where the increasing allelomorphs of 
the two genes are associated in one parent, and the decreasing allelo- 


BIOMETRICS, MARCH 1955 


TABLE 3 
Generation Means in Respect of Two Interacting Genes 
Parents: Associated zed, + dy + tad) + 
Pp, 
Dispersed dy — tad) + + ties 
B 
Backcrosses: Associated B, 3(sde + dy + ha + hy + Fi) 
B, 


Dispersed dy + he + hy — 
PF, ha + hy + ie 


PF, 43(ha + he) 

Fy + hy + 

Ss 3(ha + hs) 

Scatine Tests 
Associated Dispersed 

A P,+F, — 2B; — — Joie + Lyas) — Jays + joie + yas) 
B P,+F, — 2B, + + + + — + lyas) 
C P,+P.+ 4F, Qian, + lied — iad + 
D P, +P, + 2F, — 4F; + — + 


“‘Associated”’ refers to the cross in which increasing and decreasing allelomorphs of the 
two genes occur in the same parents (AABB X aabb); and “Dispersed” to the alterna- 
tive cross (AAbb X aaBB). Where two signs are shown before a term, the upper and 


lower signs used in the formulae refer respectively to the upper and lower families 
shown on the left. 


morphs in the other (AABB X aabb); and another where each parental 
line carries the increasing allelomorph of one gene and the decreasing 
allelomorph of the other (AAbb X aaBB). These are referred to 
respectively as “Associated” and “Dispersed” distributions of the 
genes. The mean expressions in the parental families (P, and P,) and 
in the families raised by backcrossing the F, to the parents (B, from 
F, X P, , and B, from F, X P,) vary with genic distribution in the 
parents; but the means of F, , F, , F; and the biparental third generation 
or S; (raised by random crossing among the individuals of F,) are 
independent of distribution in the absence of linkage. Free recombina- 


tion of the genes is assumed in all these formulae, and indeed in the 
whole of the present discussion. 


GENIC INTERACTIONS 77 


The values to be expected from the scaling tests (Mather 1949a) 
are shown at the bottom of Table 3. These all reduce to zero when no 
interaction is present, but each type of test depends on characteristic 
sets of interactions for its departure from zero. In other words, each 
type of scaling test is capable of detecting its own characteristic con- 
stellation of interactions. Where the mean of F; is available, D provides 
a test largely of the 7 type interaction. Test C depends to a greater 
extent on the / type interaction, and so provides a means of assessing 
both 7 and 1 interactions when used in conjunction with D. The j type 
interactions have no effect on tests C and D, but will affect the outcome 
of the backcross tests, A and B. Combinations of these tests can 
obviously be devised to detect specific types of interaction. 

It should be observed that with a particular distribution of genes 
between the parents A and B may afford only insensitive tests of 7 
interactions, for these may in part cancel out. The sign of the effect 
of the 7 interaction in all tests also varies with the distribution of the 
genes in the parent lines, but the contribution made by the ! interaction 
is unaffected by genic distribution. Furthermore, where more than two 
interacting genes are affecting the variation of the character in a cross, 
so that there may be two or more interactions of each kind, the different 
i and / interactions, as well as the 7 interactions, may tend to balance 
one another’s effects if the directions of the individual 7’s and I’s vary. 
Or to put it in other words, if, for example, of 7,5; , Zac) » tse) etc. some 
are acting in the + direction and others in the — direction, the sum 
of these z’s (which will appear in the scaling tests) may well be low 
because of the balancing relations of the different 7’s one against another. 

This balancing action, introduced by differences in sign, is always 
likely to be encountered in the contributions made to means and com- 
parisons between them. It is less troublesome when we turn to the 
effect of interactions on the second degree statistics which we calculate 
from segregating generations. The contribution of two interacting 
genes to the variances and covariances obtained from backcrosses, 
F, , Fs; and S, are given in Table 4. The variance of F, (V;r2) includes 
separate items for each type of interaction, and since these items are 
all quadratic, the contribution will be unaffected by sign. The same is 
true of the three statistics, (Viss ,.V2s3 and W,s2;) obtainable from 
the S, generation. 

The situation in the case of the F; statistics is more ambiguous. A 
portion of the effect of each type of interaction still appears as a separate 
quadratic item, and indeed the whole of the effect of the 7 interaction 
is expressed in this way. The j and / interactions, on the other hand, 
become partly confounded with the d and h items respectively, in the 


78 BIOMETRICS, MARCH 1955 


TABLE 4 
Variances and Covariances in Respect of Two Interacting Genes 
Summed variances of backcrosses: 
Var + Vaz: Associated = ${(da — + (ds — + (ha — 
+ (hy — + (G05) + + + 
Dispersed = 3{(da + + (de + + (he + 
+ (hy + + — Lav)? + — 
Variance of , 
Variance of F; means, 
Virs = 3(da — + — + — + — 400)? 
+ + + + 
Mean variance of F’; families, 
Vers = — jain)? + — + — + — 
Covariance of F; and F; family means, 
Wires = 3da(da — + 3de(ds — + thalha — + — 
+ + + + 
Variance of BIP means, 
Viss = + + Beha? + + + + + 
Mean variance of BIP families, 
Voss = + + + + Botan)? + + + 
Covariance of F,; and BJP means, 
Wises = + 3d? + 


BIP stands for biparental families of the third generation. This generation is 
referred to as Ss. 


form of terms of the type (d — 4)? and (h — 31). The size of these 
terms will obviously depend on the sign, and hence direction of the 
interactions. The partial pooling of interaction with main effect could 
serve either to enhance or to diminish the contribution made to the 
statistics according to the direction of interaction. Where several 
interacting genes are involved in the system some terms might, of 
course, tend to enhance, and others to diminish, the variance simul- 
taneously. 

_ The same is also true of the contributions to the summed variances 
of the backcrosses, though the compound terms are different, involving 
a different j with a given d, and 7 instead of / with h. The remainder of 
the interaction effects appear in compound terms involving 7 with l 
and the two j’s together. There is a further complication in these 
backcross variances, for the size of each compound term varies with 


GENIC INTERACTIONS 79 


the distribution of the genes between the parents. The backcross 
variances are indeed subject to so many sources of complication that 
they are likely to be relatively uninformative. 

The F; and S; statistics should be informative in different ways. 
Since the interactions remain unconfounded in the S; statistics, they 
can be used to help directly in the separation of main and interactive 
effects. The covariance, W523 , is likely to be of special value as it 
includes only terms in d’ and 7’ and so provides in a sense a direct 
measure of the fixable heritable variance since 7 is the fixable interaction. 
Statistics from later S generations are not likely to have this same 
advantage, as they will almost certainly contain terms in which parts 
of the interactions are confounded with main genic effects. 

The F; statistics already show this confounding of the interactions, 
and they enable us to see something of its effects. The two variances, 
Vir3 and V2; , contain the same types of term as each other, though 
with different coefficients. The terms in W,,.; are the geometric means 
of the corresponding terms in and V,r.. Thus the term in 
Vireo is replaced by 3(d, — 34.),)” in Vir3 and the corresponding term 
in Wires is — = — The corresponding 
term in V.r; depends on (d, — 34,,)” just as in Vir, , but of course, 
with its own characteristic coefficient. To put all this in another way, 
the definitions of D and H change from > and (h”) respectively in 
F, to — 3 5)’ and — 4 in Fs, 80 that the basic con- 
stitution of the terms, or components, of variations changes with genera- 
tions but is constant over ranks within the generation. The summation 
sign is placed before j and / to indicate summation over all the appro- 
priate digenic interactions which this gene shows with its fellow members 
of the polygenic system. That this is a general property can be seen from 
the general formulae for the variance of rank m in generation n of the 
selfing series and the covariance of rank m in generations n and n’ 
which are 


= 2" (da — 27°") jars)” 
+ — 3) 
+ — 1) 


BIOMETRICS, MARCH 1955 
and Ware = 2 — Jase) 
(he — (4 — lias) 
+ — 3) 
+ — 1) 


Now the definitions of D and H also change when the genes are 
linked, but they change with rank and not with generation (Table 5). 


TABLE 5 
Changes in the Main Components of Variation with Interaction and Linkage 
Sta- | Coeffi- Structure of Component 
tistie | cient 
Simple Interaction Linkage 

Vire 4 d,? d? +2 3 2par) 
D| Wirn | | de® | da(da — D> jin) | de® + 2D, dads(1 — 2pas) 

Virs de? | (da — | do? + 2 — 2pas) 

Vers | | da — jas)? | 2D — 2pas)* 

Vire he? he + 2D  hohs(l — 2pas)? 

| te | | (he FD | het + 2D, — 2pes)? 

Vors | (he — byes)? — 2pas)* 

(1 — + 2pas?) 


Pab is the frequency of recombination between genes A-a and B-b. The + in the 
linkage terms of D indicates addition for coupling and subtraction for repulsion. 


We have thus a means of separating the effects of interaction and linkage. 
In the absence of interaction, D and H are homogeneous over V;r2 , 
Vir3 and W,r2; , but will change in V,,, where linkage is acting. With 
interaction, D and H will be inhomogeneous over Vir2 , Virx and 
W,r23 , and they will vary no more between these three statistics as a 


GENIC INTERACTIONS 81 
group on the one hand and V;,; on the other, than they do within the 
group of three. Thus the tests of residual interaction and linkage used 
by Mather (1949a) and by Mather and Vines (1952) are sound—with 
one proviso. Only the j and / interactions cause inhomogeneity among 
the rank one variances and covariance of F, and F;. The definitions 
of D and H are unaffected by i interactions. These 7 interactions will 
not therefore be detected by the test of residual interaction, and may 
serve to inflate V.-,; as compared with the first rank statistics, since 
the coefficient of 7” is disproportionately large in this second rank 
variance. Inflation of V2»; would mimic repulsion linkage. The 7 
interactions may, therefore, be confused with repulsion linkage, though 
they would never mimic coupling linkage in their effects. This con- 
stitutes the only danger of confusion when conclusions are based on 
data from F, and F; ; and the inclusion of further types of family may 
well afford a means of removing even this possible confusion. The 
final resolution of this remaining problem must, however, await the 
fuller consideration which is now being given to the effects of inter- 
action on statistics from families in series obtained by mating systems 
other than selfing. 

One point may perhaps be reiterated in conclusion. Our considera- 
tion applies to all types of digenic interaction, for, as we have seen, 
all such interactions, whether we would recognise them as of the comple- 
mentary, epistatic or any other kind, can be represented, combined and 
manipulated in terms of 7, 7 and 1. The contributions made to the 
various means, variances and covariances by a pair of genes showing 
any of the classical types of interaction can be simply obtained as 
special cases from the general expressions of Tables 3 and 4 by imposing 
the appropriate relations between d, h, 7, 7 and 1 from Table 2. And, 
finally, the present method of representation and analysis can be ex- 
tended to trigenic and higher interactions with, we believe, equal 
prospects of successful understanding and interpretation. 


Summary 


The knowledge that the genes mediating continuous variation are 
carried in the nucleus enables us to assume the genetical specification 
of the families in suitably designed experiments, except in respect of 
linkage relations which must generally be inferred from the variation 
observed. The specification of phenotypic effect of the genes is, how- 
ever, seldom if ever capable of the same precise assumptions. The 
effects of the genes, and their interactions, must generally be inferred 
from the phenotypic variation observed. 

The phenotypic effect of a gene can be described completely in terms 


| 
i 


82 BIOMETRICS, MARCH 1955 


of the parameters d and h used by Fisher et al. (1932) and by Mather 
(1949a). Four more parameters are required for the complete descrip- 
tion of a digenic interaction. These may be conveniently defined as 
the “interaction” comparisons (in the statistical sense) of the d and h 
“main effects” of the two genes. Thus 7,,, is the interaction of d, and 
d, , jai» that of d, and h, , j,;. that of h, and d, and l,,, that of h, and h, . 

All digenic interactions, including the classical types such as comple- 
mentary action, epistatic action and so on, can be defined in terms of 
relations between d, h, i,j and J. Different types of interaction (in the 
classical sense) can thus be expressed and combined in terms of 7, 7 and J. 
This system of describing interactions is capable of extension to trigenic 
and higher orders of interaction. 

The effects of digenic interaction on means, variances, covariances 
and scaling tests derivable from backcrosses, F, , 7; and third generation 
biparental progenies (S;) of a cross between two true breeding lines are 
analysed, and shown to be usefully expressible in terms of 7, 7 and I. 
The use of scaling tests and of the second degree statistics in detecting 
digenic interactions is considered, and it is shown how the effect of 
interaction may be separated from that of linkage in the second degree 
statistics obtainable from F, and F;. The only confusion to be antici- 
pated is of 7 type interaction with repulsion linkage. Other types of 
family should help to remove even this possible confusion. 


REFERENCES 


Bateman, A. J. and Mather, K. (1951) The progress of inbreeding in barley. Heredity, 
5: 321-48. 

Fisher, R. A., Immer, F. R., and Tedin, O. (1932) The genetical interpretation of 
statistics of the third degree in the study of quantitative inheritance. Genetics 17: 
107-24. 

Mather, K. (1949a) Biometrical Genetics. Methuen, London. 

—— (1949b) The genetical theory of continuous variation. Proc. 8th Int. Cong. 
Genetics. ‘Hereditas. Suppl. Vol. 376-401. 

—— and Vines, A. (1952) The inheritance of height and flowering time in a cross of 

Nicotiana rustica. Quantitative Inheritance, 49-80 ed. E. C. R. Reeve and C. H. 

Waddington. H.M.S.O. London. 


QUANTITATIVE STUDIES IN DIPHTHERIA PROPHYLAXIS. 


AN ATTEMPT TO DERIVE A MATHEMATICAL CHARACTER- 
IZATION OF THE ANTIGENICITY OF DIPHTHERIA 
PROPHYLACTIC* 


L. B. Hour 


The Wright-Fleming Institute of Microbiology, 
St. Mary’s Hospital Medical School, 
Paddington, London, W.2. 


Examined quantitatively, the antibody responses of animals and 
children to inoculations of different forms of diphtheria prophylactic 
vary greatly. The dose-response curves from such materials, however, 
do show a similar pattern, and there is a great variation between 
different forms of prophylactic in the dosage required to induce some 
arbitrary level of response (Jerne & Maalge, 1949). 

As Jerne & Wood (1949) point out, an assay of a test preparation 
(T.P.) in terms of a standard preparation (S.P.) is strictly valid only 
if “the less potent preparation behaves as though it were a dilution 
of the other in a completely inert dilutent. The relative potency of 
the T.P. in terms of 8.P., defined as the ratio of doses required to 
produce a given response, is then independent of the dose level of 
response at which it is measured. They continue “.. . this is the only 
definition of relative potency that would normally be regarded by the 
bio-assayist as satisfactory .... An instance of current interest in 
which this assumption does not hold is the assay of diphtheria and 
tetanus toxoids in commercial products containing aluminium hydroxide, 
using as §.P. a reference sample of highly purified toxoid . . . the dose- 
response curves of the two preparations have different upper asymptotes 
and cannot be described by the same form .... The assay is thus 
invalid.” 

Here then we have the problem; two preparations have a property 
in common, viz. the ability to cause the development of antitoxin when 
injected, but the one cannot be expressed quantitatively in terms of 
the other in the usual way. 

The present communication is concerned with (a) an attempt to 
overcome this difficulty by finding antigenicity equations applicable 
to all types of diphtheria prophylactic (and probably other antigens) 


*Based on a communication read before the Third International Biometric Conference, Bellagio, 
September 1953. 


84 BIOMETRICS, MARCH 1955 


by which they can be completely described in mathematical terms, and 
(b) indicating some difficulties involved in the translation of results 
obtained in the laboratory to the field. 

When we give groups of children or animals an inoculation of an 
antigen, we may measure the specific response in two quite different 
ways, (a) by the percentages that attain or exceed some arbitrary level 
of response, or (b) by determining the geometric mean titre of responses. 
Whichever method is used, it is known that the distribution of titres 
among a group of similar subjects identically treated is lognormally 
distributed: vide Barr (1950) in respect of horses, Barr, Glenny & .. 
Randall (1950) for children, and Holt (1951) for guinea pigs. In practice 
we are more often interested in knowing the percentage that fails to 
attain some measure of response than to know the percentage at each 
level of response. From this it follows that the table of the cumulative 
normal distribution is of more value to us than that of the ordinate 
(Fisher & Yates, 1948). 

Since the distribution of titres in a group is lognormal, it follows 
(a) that comparisons should be made in terms of geometric means or 
log geometric means and (b) that two groups may have the same 
geometric mean but the standard deviations of logs of titres may 
differ considerably; therefore a strict comparison cannot be made simply 
from the geometric means. 

The relationship between the geometric mean titre and the percentage 
that attains or exceeds some arbitrary titre may be expressed as 


U = log u + o (probit y — 5) (a) 
where 

U = log geometric mean titre 

u = some arbitrary titre 

o = standard deviation of logs of titres 

y = the percentage of subjects possessing u units of 


antitoxin, or more, per ml. of serum. 


If we now examine the results obtained from a series of graded 
inocula (similar subjects and the same material and testing technique, 
etc.), a dose response curve may be drawn by plotting the percentage 
possessing some arbitrary level of antitoxin response, or more, against 
the dose administered, and this is characteristically sigmoid in shape. 
‘When, however, the probit of that percentage is plotted against the 
logarithm of the dose administered, a straight line, i.e. the probit 
regression line, may be obtained (Hazen, 1914; Whipple, 1916; Finney, 
1952). The experimental evidence for this statement (e.g. Carlinfanti, 


DIPHTHERIA PROPHYLAXIS 85 


1948; Holt & Bousfield, 1949) relates to the probit of the Schick con- 
version rate (S.C.R.) which does not correspond precisely to the per- 
centage, y, of subjects attaining an arbitrary titre (see later section). 
Under certain plausible assumptions, however, a linear relationship 
between probit S.C.R. and the log dose implies a linear relationship 
between the probit of y and the log dose (see the discussion on Schick 
Conversion below). 
The equation for the straight line may be written 


probit y = blog 


where 
b = slope of the regression line of probit y 
on log dose, 
Z = the dose, 


and C = a constant. 


If, therefore, a dose z is required to give a 50% response, then the 
probit for the percentage attaining, or exceeding, the titre u from a 
dose Z will be 


b(log Z — logz) + 5 


which may be rewritten as 
probit y = b log (2) +5 (b) 


In brief, equation (a) describes the distribution of responses at one dose, 
whereas equation (b) describes the whole dose-response curve. 

Now equation (b) may be combined with equation (a) with the 
advantage of having all the variables present in one expression. Let 


U (Z) = log geometric mean titre from a dose Z 
and U (z) = log geometric mean titre from a dose z. 


Then by substituting the right-hand component of (b) for the term 
“probit y” in (a), and putting log u = U(z) since z gives a 50% response, 
we obtain 


U(Z) = U@ +B log (Z/2), (c) 
where B = ob. 


The slope, B, of the regression line log geometric mean of titres 
on log dose is, therefore, equal to ab. 


86 BIOMETRICS, MARCH 1955 

It is essential for the validity of this transposition that o remains 
constant for all doses used for a given type or sample of stimulus. 

The whole mathematical model may be described by the following 
comprehensive equation, relating the probit of y (the percentage of 
antitoxin titres exceeding an arbitrary value u) to u and z (the log 
dose): 


Probit y = 5 + b log (Z/d) — (1/a) log (u/uo), (d) 


where b and o are defined as above and d is the dose required to give a 
geometric mean titre equal to some value up . 

It will be seen that equation (d) provides for the complete character- 
ization of an antigen in physiological terms. It is a sine qua non that 
the animal, age or weight of animal, and number of doses, route of 
inoculation and time interval(s) employed, must always be specified 
when values are given to the variables. Three constants enter into 
equation (d). These may be taken to be c, b and the dose, d, required to 
produce some arbitrary geometric mean titre, which might be usefully 
taken as 0.003 u./ml. for diphtheria antitoxin in children, as this 
corresponds approximately to a 50% Schick conversion rate (see below). 

The constants o and b appear in equation (c) only through their 
product, B. I have suspected that the value of B is unity, Prigge (1953) 
contends that this must always be so, and indeed by applying the 
conversion formula (infra) for geometric mean from 8.C.R. to the field 
data given by Holt & Bousfield (1949), B is estimated to be 0.9. Even 
if B were unity or some other constant it would still be necessary to 
determine o and b separately for a complete characterization of anti- 
genicity. 


DISCUSSION 


For a full characterization of the antigenic properties of samples 
of diphtheria prophylactic, we need to know the values of o and b as 
determined in children, and the dose required to produce some arbitrary 
measure of response. This latter, however, may be very different for 
the same preparation and subject without evident alteration in the 

values of o and 5; for instance, Holt & Bousfield (1949) comparing the 
- Schick conversion rates, in children, from P.T.A.P. (Holt, 1947) ad- 
ministered (a) subcutaneously and (b) intramuscularly, found that the 
regression lines of probit 8.C.R. on log dose were parallel but differed 
considerably in position. 

The values of o, b and B have been determined for responses to a 
single injection of P.T.A.P. in guinea-pigs (Table I). The values of 
o have been calculated for both primary and secondary responses for 


& 


DIPHTHERIA PROPHYLAXIS 87 


TABLE I. 
Antigenicity Constants for P.T.A.P. in Guinea Pigs: Single Dose. 


B (Slope of log geometric mean titre on log dose) . . . .. . 0.6 (Holt, 1950) 
o (Standard deviation of log titres) ............ 0.512 (see Table II) 


many batches of P.T.A.P., and their mean values and standard deviation 
are shown in Table II. It is of interest to note that o does not alter 
greatly from one to two doses; this is in marked contrast to the effect 
of A.P.T. in guinea-pigs (Barr & Llewellyn-Jones, 1951). 


TABLE II. 
Data on the Standard Deviation of Logs of Titres, ¢, in Guinea Pigs, for P.T.A.P. 
Primary Max. Secondary 
Responses Responses 
No. of Groups. (12 per group) 31 30 
Mean Value of o 0.512 0.436 
Range of Values of o 0.232 — 0.874 0.189 — 0.790 
Standard Deviation of Values of 0.168 0.173 


In respect of information from children the data available are not 
entirely satisfactory. In the following section an empirical formula is 
derived relating the log of the geometric mean titre to the S.C.R. viz. 


log G.M. = 3.5 + 0.7 (probit $.C.R. — 5) (e) 


If b’ is the slope of the probit S.C.R./log dose regression line, we should 
estimate B as 0.7 b’.. From the data of Holt & Bousfield (1949) on 
P.T.A.P. the estimate of B is 0.9. Barr, Glenny and Randall (1950) 
give data for ‘wo doses of A.P.T. in which o is approximately 0.4; this 
value is in close agreement with that calculated from other sources (see 
next section) where a mean value of 0.42 is found. 

Manufacturers of diphtheria prophylactics in almost all countries 
are obliged to test (or have tested) all material intended for human use. 
The tests are carried out in guinea pigs and certain minimal requirements 
of antigenicity have to be fulfilled in order thai the material be admitted 
as sufficiently potent (e.g. British Therapeutic Substances Regulations, 
1952, and the National Institutes of Health (U.S.A.) requirements, 
1948). 


88 BIOMETRICS, MARCH 1955 

The implication of these requirements is that tests on the guinea 
pig may, with reasonable safety, be used as a substitute for tests on 
children, and in a broad measure this is so (W.H.O. Technical Report 
No. 61). But recently results have come to light which cast some doubt 
on the reliability of the guinea pig in this kind of work. The position 
is made more serious by the proposed adoption of “Standard Antigens” 
(W.H.O. Technical Reports Nos. 36 and 61) which in itself is, of course, 
very desirable. As we have already seen (Jerne & Wood, 1949) one 
cannot express the antigenicity of aluminium hydroxide absorbed 
toxoid in terms of purified toxoid in simple solution, although it may 
be practicable to have “Standard Antigens” against which to standard- 
ise broadly similar types of material. 

The discrepancies found between laboratory (guinea pig) data and 
field (child) data are as follows: 


I. Using guinea pigs and comparable doses Barr & Llewellyn-Jones 
(1951) found that the value of ¢ following a single injection of A.P.T. was 
much greater than that following P.T.A.P., and, in addition, the geo- 
metric mean titre of antitoxin was about six times greater with P.T.A.P. 
than with A.P.T. Holt & Bousfield (1949) using year-old children found 
that one dose of A.P.T. gave an 87.5% S.C.R. and one dose of P.T.A.P. 
a 95-97% S.C.R. If, in the children, the geometric mean titre from 
P.T.A.P. were six times greater than that from the A.P.T. and in 
addition equation (e) held in both cases, then the 8.C.R. from the 
A.P.T. would not have exceeded 80%. 


II. The second discrepancy would seem to be more marked than the 
first. 

When H. pertussis vaccine is added to purified toxoid in solution, 
and comparative antigenicity tests made in guinea pigs it is found that 
the whooping cough vaccine has considerably augmented the response 
to the toxoid component of the mixture, measured by their responses 
to one or two doses (Faragé & Pusztai, 1949; di San’t Agnese, 1949; 
Ungar, 1952). Bousfield & Holt (1953) found that their vaccine in- 
creased the antitoxin responses in guinea pigs some 12-15 fold for a 
single inoculation. In children the same toxoid alone gave a 63% 
8.C.R. and the toxoid-vaccine mixture an 83.8% S.C.R. (Table III). 
From equation (e) this increase in 8.C.R. indicates an increase of about 
threefold in geometric mean titre of antitoxin in the children. If the 
difference in log geometric means which was found in the guinea pig 
data had been directly transferable to children then the S.C.R. would 
have been about 96%. 

All the above difficulties may be avoided and the assessment of the 


DIPHTHERIA PROPHYLAXIS 89 


antigenicity characteristics be accurately determined by specifying 
the three constants in equation (d). In practice this means that a part 
of the dose-response curve for the prophylactic under test must be 
measured in children; two dosages having a 5:1 ratio, with 30-50 children 
at each, would probably be adequate. From an examination of the 


TABLE III. 


Comparative Antigenicity of Purified Diphtheria Toxoid Alone and Mixed with H. Pertussis Vaccine, 
in Guinea Pigs and in Children (Bousfield & Holt, 1953). 


GuINEA PIG Dava 


Geometric Mean Titres U/ml. 
Dosage 
Single dose Two doses 
Exp. 1. 
(a) 1.4 Lf toxoid 0.012 0.242 
(b) 1.4 Lf toxoid plus 
400 M. H. pertussis 0.191 5.76 
Ratio (b)/(a) 16 24 
Exp. 2. 
(a) 1.0 Lf toxoid 0.0036 0.058 
(b) 1.0 Lf toxoid plus 
285 M. H. pertussis 0.042 2.85 
Ratio (b)/(a) 12 49 


Curtp Data (S. C. R. Measured 4 weeks after a single dose) 


(a) 30 Lf toxoid gave 63% S.C.R. (213 cases) 


(b) 30 Lf toxoid plus gave 83.8% S.C.R. (213 cases) 
10,000 M.H. pertussis 


individual results obtained equation (d) could be completed. The 
bleeding of small children for assessment of antibody responses is 
becoming increasingly practiced to-day (e.g. Butler, Barr & Glenny, 
1954). 

Such work need only be done on the “Standards” proposed by the 
B.S. Committee (W.H.O.) and on new prophylactics as they are de- 
veloped. The field-calibrated standards may then more reliably be 
employed in the laboratory for routine work. 


90 BIOMETRICS, MARCH 1955 


A NOTE ON THE SCHICK NEGATIVE REACTION RATE.* 


Many workers have observed that there is no one clear-cut serum 
antitoxin titre at which all subjects pass from a state of giving a Schick 
positive reaction to a negative one. Nevertheless all investigators agree 
that the higher the mean titre of a group the higher is the negative 
reaction rate in that group (Leach 1935; Parish & Wright 1938; Downie 
et al 1941; Greenberg & Roblin 1949). 

In field trials where large groups of children are employed and the 
Schick test is used as the indicator of prophylactic efficiency, it would 
be of value to be able to translate percentage Schick conversion rate 
into geometric mean antitoxin titre. 

Much of the published work in the iain between serum 
antitoxin titre and the Schick test result is not valid as the reagent used 
(Test Toxin) was not standardised (League of Nations B.S. Report 
1931; British T.S.A. Regulations, 1931) or the test and bleeding were 
not made simultaneously. The quantity of published data is still further 
restricted by the serum antitoxin titres being recorded as having a 
potency greater than or alternatively less than some value. 

The method of using the available data (Table IV) was to calculate 
the geometric mean of the extremes of the titration brackets used by the 
authors, and plot the percent negative reactors in that group against 
that titre. The percent Schick negative reaction rate (S.N.R.R.) 
increased in a sigmoid curve to a 100% asymptote with increase of 
geometric mean. 

When the probit of the percent negative reaction rate was plotted 
against log geometric mean, a straight line could reasonably be drawn 
through the points. The probit line was fitted by the standard method 
(Finney, 1952), and a slope of 1.435 + 0.159 obtained. The test for 
heterogeneity gave x” = 8.667 for 5 d.f. (P = 0.15). The geometric 
mean titre corresponding to a 50% S.N.R.R. was estimated to be 0.0032 
U/ml., whence, as a first approximation, the relationship between 
S.N.R.R. (or 8.C.R.) and geometric mean titre may be expressed as 


log geometric mean titre = 3.5 + —<> i +s (probit S.N.R.R. — 5), 


which is equivalent to equation (e) above. 
The data provided by Downie et al. (1941) were found to be com- 
pletely at variance with the remainder. These particular data were 


*The expression ““% S.C.R.” is customarily used to describe the antigenic efficiency of a course 
of active immunization. The expression “Schick Negative Reaction Rate” as used here, is meant to 
focus attention on the physiology of the response to the Schick test as distinct from the immunizing 
reagent. Mathematically these expressions are interchangeable. 


DIPHTHERIA PROPHYLAXIS 91 
obtained at the height of the secondary response (10 days after a second 
inoculation) whereas the others (apart from data B, 2nd figures) were 
derived from subjects which had received their injection(s) at least 
two months before the tests were made. 

Recently, Kurokawa et al. (1951) reported a marked discrepancy in 
the relationship between circulating antitoxin titre and the Schick test 


TABLE IV. 


Collected Data on the Relationship between Geometric Mean Titre of Serum Antitoxin and the Schick 
Test Results in Children. 


Calculation of the Slope of the Regression Line of Probit S.N.R.R. on log G.M. and the value of 
x? for Heterogeneity. 


Log G.M. Probit % 

Origin* Titre Cases No. — ve % — ve — ve 
A2 2.35 6 6 100 © 
All 57 50 87.7 
Cc } 3.24 22 21 95.4 
A1+C 79 71 89.9 6.28 
B. 2.15 9 + 20 7+ 15 75.9 5.70 
D. 2.00 27 25 92.6 6.45 
Al 16 9 56.2 
B. | 5+17 3+ 5 36.4 
C. 3.85 11 11 100 
A1+B.+C 49 28 57.1 5.18 
D. 3.60 25 24 96 6.75 
A2 17 13 76.5 
B. 3.50 16 + 17 10+ 3 39.4 
A2+B 50 26 52 5.05 
D 3.30 22 19 86.4 6.1 
A.l 26 10 38.5 
C. 3.20 21 14 66.7 
Al. + C 47 24 51 5.02 
D 3.0 12 7 58.3 5.21 
A2 48 11 22.9 
B. 3.0 20 + 16 5+ 3 22.2 
A.2. + B. 84 19 22.6 4.25 


Statistical Analysis 
1) Slope = 1.435 + 0.159 \ 
x25) = 8.667, P =0.15,f Data D excluded. 


2) o for group 
A, =0.44 B(2) = 0.43 
A, =0.40 C =0.46 
B(1) = 0.44 D =0.35 
Mean o = 0.42. 


BIOMETRICS, MARCH 1955 


TABLE IV.—Continued 


*A. Parish, H. J. and Wright, J., (1938) 
1 = Table 1. 
2 = Table 2. 

B. Valquist, B. and Hogstedt, C. (1949) 


first numbers = active immunization Table 3 
second numbers = passive immunization Table 4 
demarcation line 7 mm. 


C. Leach, C. N. and Poch, G. (1935), Table 1. 


D. Downie, A. W., Glenny, A. T., Parish, H. J., Wilson Smith and Wilson, 
G. 8S. (1941) Table XII. 


result in guinea pigs in which (a) the serum antitoxin titre was rapidly 
rising and (b) where it was uniform. These observations may account 
for the anomalous data of Downie et al. 

The finding that there is a linear relationship between probit 
S.N.R.R. and log geometric mean titre reveals the fact that there is 
another variable operating with a distribution a) . The reciprocal of 
the slope of the regression line of probit S.N.R.R. on log geometric mean 
titre must, therefore, be the resultant of this other variable and ¢ in the 
groups examined. Presumably oa, is reasonably constant and represents 
the variation in human skin capillary permeability. The groups of 
subjects examined all have very much the same value of o (Table IV), 
with a mean of 0.42. Assuming that the skin capillary variable is also 
lognormally distributed and is independent of the antitoxin titre, then 
the reciprocal of the slope of the regression line of probit S.N.R.R. on 
log geometric mean titre may be expressed as 


0.7= Veta 


and since o is, approximately, 0.42 the value of op is about 0.56. 


SUMMARY 


In view of the known gross differences in response from different 
forms of diphtheria prophylactic an attempt has been made to character- 
ise the antigenicity of any type in mathematical terms. 

Use is made of the observation that the responses among a group 
of similar subjects, identically treated, is lognormally distributed; as 
well as the observation that when the probit of the Schick conversion 
rate is plotted against log dose a straight line is obtained. 

It is found that three variables are involved, namely d, the dose 
required to effect some arbitrary measure of response; b, the rate of 


DIPHTHERIA PROPHYLAXIS 93 


increase, with respect to log dose, of the probit of the percentage of 
titres exceeding the arbitrary level; and o, the standard deviation of 
logs of titres. In addition, it is found that the product of b and a is 
equal to B, the slope of the line relating the log geometric mean to log 
dose. 

The fact that o may be different from different types of prophylactic 
signifies that neither the comparison of geometric means nor the de- 
termination of b can provide a strictly accurate characteristization of 
antigenicity; this has direct relevance to the use of standard or Reference 
Preparations for routine laboratory purposes. 

It is suggested that the values of d, b and o should be determined, 
in children, for each type of prophylactic, and in view of serious dis- 
crepancies between laboratory and field data that Standard Antigens 
should be calibrated in the field before adoption in the laboratory. 

An examination of selected published data on the relationship 
between serum antitoxin titre and Schick test result is made. From 
these data a first approximation of the relationship between percent 
Schick negative and geometric mean titre of antitoxin has been derived, 
which may be expressed as 


log G.M. = 3.5 + 0.7 (probit percent negative — 5). 


The value of 0.7 for the reciprocal of the slope of the probit regression 
line, appears to be the resultant of two constants, o the standard devia- 
tion of logs of titres and oy the standard deviation of the distribution of 
skin capillary permeability in children, in that 


0.7=Ve+a 


It is suspected that op is relatively constant and that o varies with the 
type of diphtheria prophylactic used. 


Grateful thanks are due to Dr. P. Armitage of the Statistical Re- 
search Unit, London School of Hygiene and Tropical Medicine, for his 
most valuable help and criticism of this work. 


REFERENCES 


Barr, M. (1950) Brit. J. Exp. Path., 31, 615. 

Barr, M., Glenny, A. T., and Randall, K. J. (1950) Lancet 7 6. Table I. Scroup 
A+B+C. 

Barr, M. and Llewellyn-Jones, M. (1951) Brit. J. Exp. Path., 32, 231. 

Butler, N. R., Barr, M. and Glenny, A. T. (1954) B. M. J. 7 476. 

Bousfield, G. and Holt, L. B. (1950) B. M. J. zi 73. 


94 BIOMETRICS, MARCH 1955 


Bousfield, G. and Holt, L. B. (1953) VI International Congress of Microbiology, 
Rome. 

Carlinfanti, E. (1948) J. Immunol., 59, 1. and Lancet 77 629. 

Downie, A. W., Glenny, A. T., Parish, H. J., Wilson Smith & Wilson, G. S. (1941) 
B. M. J. it 717. 

Faragé, F. and Pusztai, S. (1949) Brit. J. Exp. Path., 30, 572. 

Finney, J. D. (1952) ‘‘Probit Analysis” 2nd Edition. Camb. Univ. Press. 

Fisher, R. A. and Yates, F. (1948) “Statistical Tables” Table 1. Oliver & Boyd. 

Greenberg, L. and Roblin, M. (1949) Canad. J. Public Hlth. March. 112. 

Hazen, A. (1914) Trans. Amer. Soc. Civ. Engrs. 77, 1539. 

Holt, L. B. (1947) Lancet 7 286. 

Holt, L. B. (1950) “Developments in Diphtheria Prophylaxis” p. 154. Heinemann. 

Holt, L. B. (1951) Brit. J. Exp. Path., 32, 157. 

Holt, L. B. and Bousfield, G. (1949) B. M. J. 7 695. 

Jerne, N. K. and Maalge, O. (1949) Bull World Hith. Org. 2, 49. 

Jerne, N. K. and Wood, E. C. (1949) Biometrics, 5, 273. 

Kurokawa, M., Murta, R., Nakano, T., Yamada, T. and Kubata, K. (1951) Jap. 
Med. J., 4, 369. Table 2. ; 

Leech, C. N. and Poch, G. (1935) J. Immunol., 29, 367. 

Parish, H. J. and Wright, J. (1938) Lancet z, 882. 

Prigge, R. (1953) Bull. Wid. Hith. Org. 9, 843. 

di San’t Agnese, P. A. (1949) Pediatrics, 3, 20. 

Ungar, J. (1952) Proc. Roy. Soc. Med.,'10, 674. 

Valquist, B. and Hogstedt, C. (1949) J. Immunol., 62, 277. 

Whipple, G. C. (1916) J. Franklin Inst., 182, 37 and 205. 


PREDICTION EQUATIONS IN QUANTITATIVE GENETICS 
ALAN ROBERTSON 


Institute of Animal Genetics, Edinburgh. 


Member of the Scientific Staff of the 
Agricultural Research Council, Great Britain. 


One of the fundamental concepts in the application of statistical 
methods to the analysis of the inheritance of characters showing con- 
tinuous variation is the additive genetic variance o, , the variance 
in any character in a population that is due to the average effects of 
genes. If this is expressed as a fraction of the total variance, o; , we 
get the related parameter, h’, the heritability (in the narrowest sense) 
of the character. It can be shown without great labour that the herita- 
bility is also equal to the regression coefficient of breeding value on 
performance or phenotype. This short paper presents an alternative 
derivation from the point of view of the combination of information 
from different sources, an approach which may be useful in teaching. 
Several other important prediction equations in quantitative genetics 
can be fitted into the same pattern. 

If we have a measurement P of an individual in a population in which 
the character measured has a mean P, we may consider ourselves as 
having two independent pieces of information on the animal’s breeding 
value. They are: (i) that the animal is a member of a population whose 
mean breeding value is P with variance o; ; (ii) that the animal’s own 
performance is P and that this will have variance o} — o, about the 
true breeding value. = 

If we knew only (i), we should take P as the best estimate of the 
individual’s breeding and it would have error variance o,. If we knew 
only (ii), we should take P as the best estimate with error variance 

2 2 

In combination, the correct weight to give the two estimates is the 
reciprocal of their respective variances. We then have 


95 


BIOMETRICS, MARCH 1955 


— 
2 


2 2 
— 


_ — + Po, 
2 


=P+(P-P)% 
oD 
=P+h(P — P) 


which is the usual regression formula. This derivation shows clearly 
the premises on which the formula is based. 

We may now extend the scope of this presentation so that we can 
easily deal with other prediction formulae. The general situation is 
that we are interested in a primary variable whose probable value we 
wish to predict (in the previous case the breeding value), which is 
obscured by some secondary variation, which may itself be partly 
genetic in origin. The regression coefficient in the prediction formula 
is just the fraction which the real variance of the primary variable 
makes up of the total. 

As a first example, we may wish to evaluate the breeding value of a 
series of males by a progeny test in which there is no environmental 
variance common to members of a progeny group, a condition which 
is perhaps not often fulfilled. The primary variance between groups 
due to sires is c;/4 while the secondary obscuring variation is due to 
sampling within groups and is equal to [o> — (03/4)]/n where n is the 
number of offspring in the group. The regression coefficient in the 
prediction of the true value of progeny of this sire from the observed 
mean value of his n offspring is 


2 
4 
2 
4 n 


which on manipulation becomes the accustomed formula 
1 + — Ih’ 


The information put into the prediction is (i) that the sire belongs to a 
given breed in which the genetic variance between progeny groups is 


PREDICTION EQUATIONS 97 


o;/4 (ii) that the observed average of his progeny has sampling variance 
— (03/4)]/n. 

The same formula would apply if we wished to predict the breeding 
value of another member of the progeny group whose own performance 
had not been included in the group average i.e. in family selection. We 
may cast the problem of family selection more generally as follows. 
Suppose we are dealing with a population made up of families of average 
relationship r in which the observed phenotypic correlation between 
relatives is t (see Lush, 1947). In other words, the genetic and pheno- 
typic components of variance within and between groups are 


Between groups| Within groups 


Phenotypic to? (1 — t)o? 
Genetic ro? (1 — 


Ignoring for the moment selection within families, we can take first 
the situation where the animal chosen is not itself measured, as for 
example when a cockerel is chosen on the egg production of his sisters, 
the members measured being considered only as representatives of the 
family. The regression is then given by 


nrh? 
2, 
ie, + 


If there is no environmental similarity between family members, we 
can write t = rh” and the formula becomes similar to that discussed 
above for progeny testing. If, on the other hand, we are choosing an 
individual whose measurement is included in the family average, we 
are interested in these actual groups of n relatives and the sampling 
contribution of the genetic variance must be included in the primary 
variance. Then we have for the regression coefficient 


2 (1 — no; 
“+ n 1+(n— Ir 
n 


If we wish to take into account also the animal’s own phenotype, 
it is simpler to use the two quantities P — F, the deviation of the 
individual from the family mean, and F — P, the deviation of the 
family mean from the population mean. These have the advantage 


98 BIOMETRICS, MARCH 1955 


that they are statistically independent and knowledge of one tells 
nothing about the other. We can then simply add together the pre- 
dictions from the two variables. The regression coefficient of breeding 
value on family mean we have just obtained. As we are dealing with 
deviations from the observed mean, the effective genetic variance within 
families is (1 — r) o; (n — 1)/n and the phenotypic (1 — t)o; (n — 1)/n. 
The regression coefficient for P — F is then h?(1 — r)/1 — t. The full 
equation reads 


Lt — Dr 
1+(n— 1)t 


as given by Lush. This derivation of the basic equation of family 
selection is more congenial to the author than that by path coefficients 
and is presented as an alternative which may be of value to those learn- 
ing the subject. 


G=P+ 


Summary 
The basic prediction equation of quantitative genetics. (that of 
breeding value on performance) is derived from the point of view of the 
combination of information from different sources. The principle is 
extended to several other prediction equations in family selection and 
progeny testing. 
ACKNOWLEDGEMENT 


The author gratefully acknowledges Professor J. L. Lush’s comments 
on the manuscript. 
REFERENCES 


Lush, J. L. (1947). Family merit and individual merit as bases for selection. Amer. 
Nat. 81, 241-261 and 362-379, 


DETERMINING THE FRUIT COUNT ON A TREE BY 
RANDOMIZED BRANCH SAMPLING* 


Raymonp J. JESSEN 


Iowa State College 
Ames, Iowa 


In crop estimation work and in some areas of biological and pomo- 
logical research, the problem of determining the total number of fruits 
on a tree sometimes arises. If an accurate count of all fruits is attempted, 
this may be quite an onerous and time consuming job,—especially, if 
the fruits are to be left on the tree undamaged. If the fruits are picked 
before counting, in order to improve the accuracy of the results, the 
removal of the fruits may seriously interfere with other aspects of the 
investigation. A method of obtaining reasonably precise estimates 
of the total fruits by sampling, so that little time is required, may be 
of some interest. The purpose of this paper is to describe some possible 
schemes and compare some aspects of their efficiencies. 

The object of sampling is to select some portion of a relatively 
large total which will represent that total reasonably well. In the 
present case, the object is to select a few of the many smaller branches 
of a tree in such a manner that counting the fruits on these sample 
branches will enable us to obtain a reasonably accurate estimate of 
the total fruits on the tree. At present, we shall consider only those 
schemes which select the sample branches by a randomizing procedure. 

Suppose the branching system of a tree is represented as in the 
following diagram: 

The trunk, branch number “0”, splits into two branches at fork I. 
Branch 1 of this fork splits into 3 branches at fork II, etc. Suppose 
all the fruits of the tree are borne on the peripheral branches, the 
number being indicated by the encircled figures. Thus branch 1 of 
fork III has 12 fruits, branch 1 of fork VI has none, etc.; the tree has 
64 fruits borne on its 8 “fruiting” branches. 

Suppose we wish to determine the fruit count of this tree by con- 
fining our counts to two fruiting branches selected at random. This 
could be done by numbering each of the 8 fruiting branches from 1 to 


*Journal Paper No. J-2547 of the Iowa Agricultural Experiment Station, Ames, Iowa. Project 
No. 1005. 


99 


BIOMETRICS, MARCH 1955 


8, choosing two random digits from 1 to 8, say 8 and 4, and taking those 
designated branches (Nos. 8 and 4) for the sample. If counts are made 
on those two branches, we can obtain an average fruit count per branch, 
which when multiplied by the total number of fruiting branches, 8, 
_ provides an estimate of the total in the tree. If, in our case, serial 
number “8’’ refers to branch 2 of fork VI and serial number ‘‘4’’ is 
branch 2 of fork IV, we obtain the counts 15 and 5, an average of 10.0, 
or an estimated total for the tree of 8 X 10.0 = 80. 

The above scheme is simple and, if counts are accurately made, will 
provide unbiased estimates of the total count. It may, however, be 
quite laborious to identify and number all the fruiting branches on a 
tree, such as this scheme requires, not only to provide a means for 
randomizing the selection of the branches but also to provide a means 
to estimate the total for the sample. 

In order to avoid the problem of complete branch identification 
and numbering and still obtain unbiased estimates of the total fruit 
count, the following scheme is proposed. Let us take a position at 
fork I and decide by a random draw of a 1 or 2 whether to follow branch 
1 or branch 2. Suppose 2 is drawn. We proceed up branch 2 to fork 
III and since there are two possible branches we draw another 1 or 2 
at random, say 2 is drawn. Proceeding up to fork IV, suppose we draw 
another 2 at random which puts us at branch 2, our sample branch 
which must be counted. Obtaining the count of 5 fruits, we must now 
estimate the total on the tree, which is done as follows: 


| 
100 
1 2 
6) 19 
1 
(ZT 
2 (2) }1 2 
w 
1 (S) 


RANDOMIZED BRANCH SAMPLING 


5 <4 


Estimated total = 


12x 

The denominator of this estimator may be regarded as an estimate of 
the fraction of all fruiting branches that this particular sample branch 
represents. If two sample branches are desired, the above procedure 
can be repeated with new random draws. (If the same branch is 
selected, just repeat the second series of draws.) The estimate from 
the second branch is obtained in a manner identical to the first, and 
the best pooled estimate is simply the average of these two. For 
example, suppose on the second series we obtain branch 1 of fork V. 
The estimated total is given by 


6 
1/2 1/2 1/2 * 1/2 


and the pooled estimate of the two sample branches is therefore 
(1/2)(40 + 96) = 68 


Although, in this example, the two-branch estimate, 68, is quite 
close to the true count, 64, it is more or less fortuitous. If all possible 
one-branch estimates are examined we obtain the following: 


96 


Fork and Branch Fork and Branch 
Branch No. Count Estimate Branch No. Count Estimate 
II-1 8 48 V-1 6 96 
II-3 8 48 V-2 10 160 
III-1 12 48 VI-1 0 0 
IV-2 5 40 VI-2 15 180 


_ It can be seen that our single branch estimates vary widely (from 
0 to 180) depending on the particular branch selected. This undesirable 
characteristic of this method of sampling might be reduced somewhat 
by taking branch size into account in the scheme for selecting branches. 
Another alternative is to count a group of fruiting branches. These 
and other possible procedures for increasing the precision of the estimates 
for a given fraction of fruits counted will be dealt with later in this 
paper. 

It may be of interest to test the unbiasedness of this method of 
estimating total fruits from branch samples. By unbiasedness is 
generally meant that the average of the estimates over all possible 
samples will be identical to the number being estimated. In the table 


102 BIOMETRICS, MARCH 1955 


above we have the 8 possible estimates from single branch samples. 
Since the probability of obtaining a particular branch in a sample is 
not the same for all branches, we cannot take a simple mean of these 8 
values as the average estimate of this method of sampling. A weighted 
average is required. The data for each of the 8 branches, the estimates 
obtained from each and the probability of obtaining each are: 


Branch: Ii-3 | | IV-2 | V-1 V-2 | VI-1 | VI-2 
Estimate: 48 48 48 40 96 160 0 180 
Probability: 8/48 | 8/48 | 12/48 | 6/48 | 3/48 | 3/48 | 4/48 | 4/48 


Weighting each estimate by its probability of occurrence we obtain as 
the weighted average, 64, which is identical to the true total being 
estimated. This scheme of sampling and estimating is therefore re- 
garded as unbiased. 

In order to provide an elementary test of the practicality of this 
scheme and to investigate the effects of certain modifications which 
seemed of interest, complete data were obtained from an orange tree. 
The tree, a pineapple orange approximately 25 years old, was situated 
in a Florida Citrus Experiment Station’s experimental grove at Lake 
Alfred, Florida.* The counts were made September 23 and 24, 1953. 
The number of fruits borne on each of the branches was counted and 
recorded. The circumference of each branch (except the smaller ones) 
was measured near the fork of origin and also recorded. A total of 
1379 fruits was counted. The results of the branch counts and measure- 
ments are shown in Figure 1. 

The data collected in this manner provided the means for testing 
the efficiency of a number of alternative ways of selecting the sample 
branches. For example, what is the best basis for determining the 
probabilities with which to draw a branch at a given fork: (I) equal 
for each branch, (II) proportional to the number of branches into 
which each of these branches divide or (III) proportional to the cross- 
sectional area of each branch? To examine this question the 5 main 
branches were considered. It can be seen from Figure 1 that the trunk 
divides into two branches, say I and II, with circumference measure- 
ments 21 5/8” and 23 1/2” respectively, and I divides into two further 
branches, A with a circumference of 15” and B with 11 7/8”; branch 
II on the other hand divides into 3 branches, A, B and C, with cir- 
cumferences of 10 3/4”, 17 3/8” and 14 1/2” respectively. The three 
bases for determining probabilities are described as follows: 


*John W. Sites, granted permission for making the count. 


RANDOMIZED BRANCH SAMPLING 


FIGURE 1 


U 
a 
- 
» 
= 
“ane a 
spr 
eve 
‘ 
‘ 
7 
” 
va 
7 
4 
“ 
H 
eo 
- 


104 BIOMETRICS, MARCH 1955 


PE (probabilities equal). Since in this case there are 5 branches, 
each will have a probability of 1/5. 

PPN (probabilities proportional to “number’”). With this scheme 
the probability of obtaining each main branch is made equal at 
each fork. In this case, since there are 2 branches at the first 
fork, the probability of each is 1/2. Apply the same principle 
at each of the subsequent forks, we obtain as the overall proba- 
bility of getting branch IA, 1/2 X 1/2 = 1/4; for IB, 1/2 X 1/2 = 
1/4; for IIA, 1/2 X 1/3 = 1/6; etc. 

PPA (probabilities proportional to ‘‘area’’). As a measure of the 
cross-sect'onal area of a branch, the square of its circumference 
will be used. This scheme provides at any fork that large branches 
will have a greater chance of selection than a small branch. The 
following calculations are required for the first fork: 


Totals 
Branch: I II 
Circumference: 21 5/8” 23 1/2” 
Circumference squared: 467 .64 552.25 1019.89 
Fraction of total, or prob.: -46 .54 1.00 


When similar calculations are carried out for the forks at the ends of 
these branches, we obtain as the final PPA, probabilities for the 5 
branches: 


Branches at first fork: I II 
Branch probabilities at first fork: .46 54 

1 1 
Branches at second fork: A B A B C 
Branch probabilities at second fork: 61 .39 .18 .48 .34 
Overall probabilities: -28 .18 -10 .26 .18 


The extension of the foregoing procedures to the determination of 
selection probabilities for each scheme to any and all branches on the 
tree can beseen. For evaluating the effectiveness of the three procedures 
for selecting sample branches, we shall compare the variabilities of the 
estimates of total fruits in the tree obtained from each since the estimates 
will be made by the formula 


X= 


RANDOMIZED BRANCH SAMPLING 


where X is the estimated number of fruits on the tree, 
x is the actual number of fruits on a sample branch, 
P is the probability of selecting the sample branch. 


As a measure of the precision of the estimates, we may use the 
standard error of x or its square, the variance, which for samples of 
size one, is given by the formula: 


= PAX, ~ 


where X;, is the estimate obtained from one of the N different sample 
branches, 
X is the true number of fruits—the quantity being estimated, 
P; is the probability of selecting the branch from which a 
particular estimate is made. 


The variances corresponding to each of the three bases for selecting 
branches are shown here for the case where only the 5 main branches 
of the tree are regarded as sample branches. The relevant data are 
given in Table 1. 


TABLE 1, 


Data and comparisons of reliability of the three methods of sampling the 5 main 
branches of a tree. 


Totals 
Branch Designation IA IB IIA | IIB | IIC 
Branch Serial No., i 1 2 3 + 5 5 
No. of fruits, x; 476 | 162| 85| 441] 215 1379 
Prob. of selection, P; ; PE 1/5 1/5 1/5 1/5 1/5 1.000 
Prob. of selection, P; ; PPN 1/4] 1/6] 1/6! 1/6] 1.000 
Prob. of selection, P; ; PPA . 28 .18 10 26 18 1.000 
Estimates, X; ; PE 2380 | 810 | 425 | 2205 | 1075 
Estimates, X; ; PPN 1904 | 648 | 510 | 2646 | 1290 
Estimates, X; ; PPA 1696 | 903 | 874 | 1701 | 1171 
Variance, V(X,); PE — | 602,115 
Variance, V(X,); PPA —}—/]—]— | — | 128,545 


In this case, the PPA method gave the greatest reliability, a variance 
of 128,545 as compared with 597,224 for PPN and 602,115 for PE, 
equal probability. ‘Thus, it can be said that the efficiency of PPA 
relative to PE is 468% (= 602,115/128,545 x 100), a very clear 
superiority indeed. 


| 


106 BIOMETRICS, MARCH 1955 

The use of large branches such as these does not appear to be as 
generally practical for sampling as a smaller branch. By means of the 
same procedure given above, a comparison can be made of the respective 
efficiencies of branches of different sizes. It will be convenient to refer 
to “size” of branches by the average number of fruits on them. Thus, 
branches of two sizes, averaging about 17 and 25 fruits, will be compared 
for sampling efficiency, with branches averaging 276 fruits (the 5 main 
branches). The basic figures required for this comparison are the 
variances of estimates per branch made by the several methods. These 
figures are shown in Table 2. 


TABLE 2. 


Variances of estimates of total fruits on the tree for each of three sizes of sample branches and three 
selection schemes. 


Size of Variance of X , estimated total fruits, 
branch per branch, by selection scheme: 
Branch description (Average 
and total on tree number of PE PPN PPA 
fruits per (Probability | (Prob. prop. | (Prob. prop. 
branch) equal) to No.) to area) 
5 main branches 275.8 602,114 596,957 128,530 
55 smaller branches 25.1 1,404,299 19,236,106 1,710,941 
80 smallest branches 17.3 1,932,119 19,648,726 1,818,344 


The variances in Table 2 were computed by the simple formula: 


where the tree has N possible sample branches, on some branch 7 the 
number of fruits is x, , P; is the probability of selecting the 7th branch 
and X is the total number of fruits on the tree, the quantity being 
estimated. In order to compare the efficiency of the several methods, 
it is necessary to put them on a comparable basis. For example the 
variance of X when a 25.1 fruit branch is used is 1.4 X 10°, and for a 
17.3 fruit branch, it is 1.9 X 10°, suggesting that the larger branch is 
more efficient for sampling. However, on the average, we must count 
25.1/17.3 or 1.45 times more fruit with the larger branch. A comparison 
of the efficiencies of the two branch sizes can be made if the variances 
in Table 2 are put on a per fruit basis. This is equivalent to a com- 
parison on the variances of the two schemes when the total number of 
fruits counted with each scheme is the same. The variances in Table 2 


RANDOMIZED BRANCH SAMPLING 107 
are put on a per fruit basis by multiplying each variance by the average 
number of fruit in the branch, K. Thus: 


o” (per fruit) = Ko” (per branch). 


In Table 3 are shown the variances on a per fruit basis and the 
corresponding relative efficiencies of the several schemes with the 17.3 
fruit branch with equal probabilities of selection taken as a base. 


TABLE 3. 


Variances per fruit of estimates of total fruits on the tree for the several methods of sampling and the 
relative efficiencies of each. 


Size of 
branch Variance of X, per fruit Relative efficiency of method; 
(Average by selection scheme small branch both equal 
numbers (in millions) probability taken as 100 
of fruits 
per branch) 
PE PPN PPA PE PPN PPA 
275.1 166.1 164.6 35.4 20.1 20.2 94.0 
25.1 35.2 482.2 42.9 94.6 6.9 77.7 
17.3 33.3 338.7 31.3 100.0 9.8 106.3 


Under these conditions, the most efficient scheme is the small 
branch (17.3 fruits) selected with probability proportional to cross- 
sectional areas of the branches at each forking. A close second is the 
same small branch selected with equal probability. The least efficient 
is “middle” sized branch (25.1 fruits) selected with probabilities pro- 
portional to the numbers of branches at each forking. The expected 
loss in efficiency, as larger and larger branches are taken, shows up 
only when equal probability of selection is used. When other selection 
schemes are used, there is no clear trend. In general, the PPN scheme 
of selection is very: poor, although it is probably the simplest and 
quickest to carry out when small samples are taken. Of the three 
probability schemes, the one using equal probability in selecting branches 
seems most difficult and time consuming to carry out in practice. Unless 
something better could be devised, it appears that each sample branch 
must be identified and given a number from 1 to N, so that branches 
can be selected purely at random. Operationally, the PPA is quite 
simple to carry out and gives no loss in efficiency over the simple random 
scheme. 


108 BIOMETRICS, MARCH 1955 

The rather high efficiency of the large branches in the PPA scheme, 
should be regarded as spurious or, at least, with some skepticism. It 
must be remembered that it is based on only 5 observations, whereas 
the others are{based on 55 and 80 observations, and all are based on 
one tree for one season! 

With the PPA scheme of selection, it appears that stratification 
by main branch would not be very effective in increasing precision— 
particularly in view of the relatively high efficiency of the main branches 
as sample branches. Consequently, the indicated efficiency of the 
5 main branches as strata with PPA selection of the 17.3 fruit branch 
gives an increase of only 8% over unstratified. This is probably smaller 
than that which could be expected from trees in general. 

In the foregoing discussion, the assumption has beeri made that all 
fruits are borne on the “end” branches. In the case of oranges, a 
number of fruits are borne on small branches directly connected with 
relatively large branches. With the PPA scheme of selection, this 
“forking” can be dealt with as any other forking of branches. In this 
case, a relatively small probability is given to the selection of the small 
fruiting branch. However, this branch is usually so small in diameter 
that it is difficult to measure its relative size accurately. In this case, 
it may be advisable to count the fruits on this branch and then proceed 
up the tree with the sampling procedure. To obtain unbiased estimates 
we compute the estimate in two parts. For example, if we have made 
the following observations, where between the 2nd and 3rd forking 
10 fruits were found and counted but sampling continued through the 
5th forking where the sample branch yields 20 fruits: 


Forking numbers: 1 2 3 4 5 
Probability of branch, given the fork: 1/2 1/3 1/5 1/3 1/8 
Number of fruits counted: 10 20 
[10] [20] 
is given by: 73) + 
or ~ 60 + 5400 = 5460. 


In the foregoing work, the “intermediate” fruits (those which were 
counted along the sampling path as the 10 in the example) were combined 
with the sample branch fruits in the following manner: 


X,; = 5460 

1 
p= 2X 1/8 X 1/5 X 1/8 X 1/8 = 3G 
1 
ys = DX. = 575 X 5460 = 20.222 


RANDOMIZED BRANCH SAMPLING 109 


fruits as compared with the corresponding “x,” value of 20. The y, , 
therefore, are the actual fruits to which an imputed value of the “‘inter- 
mediate” fruits is added. The 10 intermediate fruits can now be 
regarded as allocated to the sample branches and, in this case, the 
sample branch was allocated 0.222 fruits. In the analysis, the results 
of which are given in Tables 2 and 3, we actually used the y,’s instead of 
X,’s in order to keep constant the total number of fruits dealt with. 


Summary and Conclusions 


(1) A complete count of all fruits on an orange tree was made and 
the number found on each branch was recorded. Each branch having 
a circumference value of 1’’ or more near the forking was measured. A 
total of 1379 fruits was counted. 

(2) Three methods of selecting branches as samples for estimating 
the total number of fruits were tested. Three different branch sizes 
were tested for efficiency. 

(3) A method of selecting branches, wherein each branch at a 
forking is given a probability of selection proportional to its cross- 
sectional area, was found to be quite efficient. In fact, this scheme 
gave efficiency comparable to that in which each fruiting branch is 
selected with equal probability. The equal probability scheme is not 
practicable since it requires some identification of all fruiting branches 
before sampling can be carried out. The unequal probability scheme 
described herein does not require this information for unbiased estimates. 


A FURTHER NOTE ON MISSING DATA 
Horace W. Norton 


Agricultural Experiment Station 
University of Illinois 


Nelder (1954) pointed out that an estimate of a missing datum is 
not merely a convenient value for facilitating analysis of variance, but 
is really an estimate of what would have been observed if the model 
on which that estimate is based is true. An error in his formula for 
the variance of the estimated missing value in a randomized block 
design should be corrected. The correct formula is 


r+ti-1 
whereas Nelder has (rt — 1) in the numerator. 

It should prove helpful to some to point out that inspection suffices 
to show that Nelder’s formula is incorrect. Remembering the math- 
ematical model, it is obvious that the general mean, the constant for 
the affected block and that for the affected treatment can all be estimated 
with any desired accuracy, simply by increasing the numbers of blocks 
and of treatments. Hence so can their sum, which is the estimate of 
the missing value. Nelder’s formula is not conformable with this 
observation, having a lower limit of oc’ as r and t become large. On 
the other hand, his formula for the r X r Latin square is correct, and 
is of the order of 3c”/r as the square becomes large. 

In referring to Query 96, which raised a question about “impossible” 
estimated values, another error has occurred in Nelder’s paper. The 
missing value, estimated to be —6.64, has a sampling error of 8.23 on 
32 degrees of freedom. The 95% confidence interval is therefore 
—6.°4 + 16.76 (rather than Nelder’s value of 8.10), thus giving no 
appreciable indication whether the estimated value is based on an 
erroneous model. 

There is some interest in the fact that not only missing values may 
have “impossible” estimated values. In the example of Query 96 the 
model leads to estimates of —3.23 and —1.48 for bait A for replications 
. 4 and 11, respectively, but these are small compared with the sampling 
error of 8.23. 

While tests of “possibility” of estimated values may occasionally 
prove useful, it is probably always better to test for additivity, as 
discussed for this example by Tukey (1954). 


var (y) = ra 


REFERENCES 


Nelder, J. A., (1954) ‘“‘A note on missing-plot values,” Biometrics, 9: 400-401. 
Tukey, J. W., (1954) “Comment on Queries Nos. 96 and 103,” Biometrics, 9: 412-413. 


110 


QUERIES 


GerorcE W SNEDECcOR, Editor 


QUERY: In Biometrics 5, page 232 (1949), Tukey gave a test 
113 for additivity in a 2-way table. He indicated that the theory 

could be applied to other designs. We often make observations 
on animals in several periods of time using randomly selected Latin 
squares to allocate the animals to the periods. As an example, we 
counted responses of 5 animals, each subjected to 5 conditions, during 
5 periods of one week each. The numbers of responses are shown in 
the table. How can we test additivity in this Latin square? (Note: 
The data are given in the first lines in Table I below. Ed.) 


Let x denote the array of original observations. As in a 
simple 2-way table, the rows, columns and now treatments 
are bordered with means and deviations. ‘The k-array 
contains constants due to fitting the additive model. For example, 
ky, = 391.36 + 4.64 — 105.36 — 72.96 = 217.68. Deviations x — k 
are forced to add to zero in rows, columns and treatments. 

Now form the y-array of Table II. The easiest one to use here is 
= — k As an example, y,, = (217.68 — 391.36)? = 
301,647. It is convenient to divide each entry by 1000 then round. 
Except for the rounding, this will have no effect on the results. Analysis 
of variance of y gives S = 533,996, the interaction sum of squares. 


Let P = yi; (xis — his) = (302)(—23.68) + + (2)(—12.88) 
= 9,232. Then 


ANSWER: 


2 


S 533,996 
We now have this analysis of variance: 
Interaction, SS (Table I) 12 44,391 
Non-additivity 1 160 160 
For Testing 11 44,231 4,021 


Clearly, F = 160/4,021 is non-significant. There is no evidence 
against the hypothesis of additivity. 

The example just given is the application to a Latin square of a 
general procedure, which can be applied to test nonadditivity in very 
general situations. In general, let x be the observations, k the result 
of fitting, and x — k the residuals. Form y = ¢ (k — c,)’, where c and 
¢, are convenient constants (in the example c = 0.001 and c, was taken 


111 


9L°9%— 
09° $98 


89° 


BIOMETRICS, MARCH 1955 


6° 9% 


£9¢ 


yuonuy 


poueg 


— 7 


112 
| 
| 
g g g g | 
| 
818 81/8 Balas 
$12 8/8 gis | 
= i a 
g | 
| | 8/8 R18 8/8 | 88883 
| 
a 
| ja | 12) < 
| we we we 
| & & t 
| 


93.44 


QUERIES 113 


TABLE II. 
= (kez — k...)2/1000. 
Period 
Animal 
1 2 3 4 5 

1 B 302 D 2 Cc O A 24 E 519 
2 D 351 B 469 A 397 CT 
3 C 90 A 98 E 59 B 37 D 261 
4 E 298 C 295 B 97 D 553 A 602 
5 A 595 E 0 D 23 C 16 B 2 


S = Interaction Sum of Squares = 533,996 


as the grand mean of the k’s). Let h be the result of fitting to y in the 
same way as k was the fit to x (in the example, h is the fit of periods, 
animals and treatments to y). Then the sum of squares for non- 


additivity is 
— — kes) P 
2 

(ysis — his) 
where, in the numerator, (y;; — h;;) can be replaced by y;; without 
change in the value of the sum. (The choice in the numerator is a 
matter of arithmetic convenience. In the denominator, we must get 
at the sum of squares of the y;; — h,; , either directly, or by way of an 
analysis of variance.) 

The original application to a balanced two-way design is another 
special case of the general procedure. There, however, the arithmetic 
is simplified if we use a seemingly quite different but numerically 
equivalent approach. 


Joun W. TuKEY 


ERRATA 


In Query 112, p. 568 of the December 1954 issue of Biometrics the 
following in Table III should be changed from 


Differences of 82>1,2 82>1,2 

established 

sign at 5% 6>1 6>1 
to 

Differences of 8>1,2 8>1,2 

established 7 > 1 7>1 

sign at 5% 6>1 6>1 


ABSTRACTS 


Communication Prononcee A La Societe Francaise De Biometrie Le 24 
Novembre 1954 


A. HUET, D.SCHWARTZ, A. VESSEREAU. Etude du Facteur 
‘Sujet’ et du Facteur ‘‘Vaccin’” dans la Vaccination au B.C.G. 


302 


Au cours de vaccinations collectives importantes effectuées par les 
soins du Centre International de l’Enfance, il a été possible de rechercher 
linfluence du facteur “ampoule de vaccin” en vaccinant plusieurs 
enfants avec chaque ampoule et en étudiant ensuite les points suivants: 
d’une part, on a examiné la répartition entre les ampoules des sujets 
demeurés non allergiques aprés la vaccination, d’autre part, pour les 
sujets allergiques, on a mesuré la dimension de l’induration consécutive 
au test tuberculinique, et recherché par analyse de la variance |’existence 
éventuelle d’un facteur “ampoule”. 

On a essayé en outre de caractériser un lot d’ampoules d’aprés les 
dimensions de |’induration mesurée sur les sujets; toutes les fois qu’on 
décéle l’existence du facteur ‘ampoule’, les mesures correspondant 
une méme ampoule ne sont plus indépendantes; on est ainsi ramené & 
rechercher une valeur typique pour une collection de K objets mesurés 
chacun avec un nombre variable N ; de répétitions; il y a lieu de caractér- 
iser la collection par une moyenne pondérée des moyennes par objet. 

Les auteurs ont proposé des formules donnant, en tenant compte 
du nombre variable d’enfants vaccinés par ampoule, des estimations de 
cette moyenne pondérée et de sa variance. 


M. OLLAGNIER. Utilisation des Fiches Perforees a 80 Co- 
303 lonnes pour l’Interpretation des Resultats des Experiences 
Agronomiques Factorielles. 


L’utilisation des cartés perforées 4 80 colonnes, décr'+e par O 
Kempthorne pour les essais de type 2°, permet 4 l’Institut de Recuerches 
pour les Huiles et Oleagineux (I.R.H.O.) l’analyse rapide de séries 
d’essais factoriels (2", 3", 4 X 4 X 2,3 X 3 X 2,3 X 2 X 2) pour 
lesquels un nombre élevé de facteurs est étudié. A chaque parcelle 
correspond une carte sur laquelle sont perforées d’une part les données 
expérimentales et d’autre part les participations positives ou négatives 
de la parcelle aux différents effets, chaque effet étant subdivisé en autant 
de fonctions linéaires que de degrés de liberté. Les interactions d’ordre 
élevé généralement négligeables sont utilisées pour estimer |’erreur. 
On évite ainsi tous les calculs classiques d’analyse de variance (sommes 
de carrés, terms de correction). Le procédé n’est financitrement rentable 
que si l’on traite un nombre suffisant d’essais et de facteurs par essai 
(10 & 15). 

114 


THE BIOMETRIC SOCIETY 


Biometric Symposium in Brazil. The next international meeting 
of the Society will be a Biometric Symposium in Campinas, near Sao 
Paulo, Brazil. It has been scheduled for July 4-8, 1955, following the 
meetings in Rio de Janeiro of the Inter-American Statistical Institute 
on June 10-22 and of the International Statistical Institute on June 
24-July 3, in which the Society has been invited to sponsor a program. 
Since The Biometric Society has the status of a Section in the Inter- 
national Union of Biological Sciences, the Symposium in Campinas 
also meets under the auspices of the Union. The travel funds that have 
been made available by the IUBS, by the National Science Foundation 
for United States citizens, and by other organizations for staff members 
are making it possible to arrange a varied and challenging program. 
The Symposium will consider the Role of Biometric Techniques in 
Biological Research, with sessions or papers on experiments with 
perennial crops, grazing and feeding experiments, biometrical genetics, 
population genetics, bioassay, sampling techniques and medical statistics. 
Local arrangements for the Symposium are being handled by Dr. C. C. 
Fraga, Instituto Agronomico, Campinas, Est. Séo Paulo, Brazil. The 
program and general plans are under the chairmanship of the President 
of The Biometric Society, Professor W. G. Cochran, Johns Hopkins 
University, Baltimore, Maryland, U.S.A. Anyone who plans to attend 
—the Symposium is open to all—is urged to write one of the above or 
to the Secretary of the Society, Box 1106, New Haven 4, Connecticut. 

European Seminar in Biometzy. Plans are progressing for a Seminar 
in Biometry next September under the sponsorship of the Italian 
Region. Lasting three weeks, it will provide courses, with laboratory 
exercises, on the biometrical aspects of the design and analysis of 
biological experiments. Through the courtesy of the Italian Govern- 
ment, the Seminar will meet in the famous Monastero Villa at Varenna 
on Lake Como. Twenty or more graduates from different branches of 
biology and related fields can be accommodated, and, thanks to a grant 
from the IUBS, expenses for each participant will be held to a minimum. 
All inquiries should be addressed to Dr. L. L. Cavalli-Sforza, Via Darwin 
20, Milano, Italy, who is in charge of the project. We hope that similar 
Seminars can be continued in future years, rotating among different 
European countries. 

WHO. Dr. Manuel Aycardo served as Observer for The Biometric 
Society at the Fifth Session of the Regional Committee for the Western 
Pacific of the World Health Organization in Manila, P.I., on September 
10-16, 1954. Committee members representing 14 countries and 


115 


| 
> 
L 


116 BIOMETRICS, MARCH 1955 
delegates from 22 international associations attended. One resolution 
passed by the Committee related to the appointment of a Regional 
Statistician. In his statement at the Session, Dr. Aycardo emphasized 
the need in health work to plan statistically and that failure to do this 
could make later evaluation of the work impossible. 

Netherlands. Two biometric sessions, both at the University of 
Utrecht, were sponsored in 1954 by members of the Society in collabora- 
tion with two other Dutch biometrical clubs. On February 25, Professor 
J. Meertens and Dr. A. Drion gave papers on biometrical problems in 
genetics. In the meeting of October 27, lectures on the use of statistical 
methods in different branches of research were given by Th. J. D. 
Erlee (Uniformity trials in sugarcane), Dr. D. Dresden (Insecticides), 
Ir. Th. Ferrari (Multifactoranalysis), Ir. H. de Miranda (Organoleptic 
problems), A. A. van Soestbergen (Toxoplasmosis) and Ir. J. van Soest 
(Forestry problems). By courtesy of the Netherlands Statistical 
Society (Industries Section) members of the biometrical societies were 
invited to hear at Utrecht a paper read by Dr. Read (Manchester) on 
industrial experimentation. 

ENAR. The Region met jointly with the Statistics Section of the 
American Public Health Association on October 13 in Buffalo, New York, 
during the annual meeting of the APHA. The Uses of Sampling in 
Public Health and Related Fields were considered in papers by M. 
Rosenstock on Application of sampling in the evaluation of health 
education material, by A. Bachrach on The application of sampling 
methods for calculating hospital stay, and by D. M. Schneider on Use 
of sampling techniques in the adjustment of uniform hospital rates. 

The Biometric Society (ENAR) will meet with the America 
Institute of Biological Sciences at Michigan State College, East Lansing, 
on September 5-9, 1955. Titles and abstracts for contributed papers 
for The Biometric Society should be sent to Dr. Earl L. Green, Division 
of Biology and Medicine, U.S. Atomic Energy Commission, Washington 
25, D.C., not later than May 15, 1955. 

Region Francaise. La derniére réunion de la Societé a eu lieu le 24 
Novembre au Laboratoire de Zoologie de L’Ecole Normale Supérieure, 
Paris. L’ordre du jour était le suivant: M. Ollagnier: L’utilization des 
fiches perforées pour l’interprétation des résultats des expériences 
factorielles agronomiques. Dr. A. Huet, D. Schwartz, A. Vessereau: 
Etude du facteur “sujet” et du facteur “‘vaccin” dans la vaccination 
au B.C.G. 

Switzerland. At the November 27 meeting of the Swiss members of 
the Society, held in the Ophthalmological Clinic of the University of 
Geneva, the following papers were presented: La Biométrie en Suisse 


= 


BIOMETRIC SOCIETY 117 
by A. Linder, Expériences biométriques en Endocrinologie by R. 
Borth, Biometrical problems arising out of alcoolism by E. M. Jellinek, 
and L’organisation et la travail de la Division des Services d’Epidémio- 
logie et de Statistiques sanitaires de l’Organisation mondiale de la 
Santé by Y. Biraud. 

WNAR. During the Berkeley meetings of the American Association 
for the Advancement of Science, the Region co-sponsored three sessions 
on December 27-28, 1954, in collaboration with the Third Berkeley 
Symposium, the American Statistical Association, the Ecological 
Society of America and the Institute of Mathematical Statistics. The 
first, on Statistics in Biology and Genetics, featured papers on Struggle 
for existence by T. Park, J. Neyman and E. L. Scott, Some genetic 
problems in controlled populations by E. Dempster, and Some genetic 
problems in natural populations by J. F. Crow and M. Kimura. The 
Design of Experiments in Fisheries was the subject of the second 
program, with papers on Biological assumptions involved in estimating 
mortalities to downstream migrant salmon passing dams by C. O. 
Junge, Jr., Use of logbook data in the measurement of distribution and 
abundance of commercial fish stocks by M. B. Schaefer, and Some 
remarks on the design of a sampling program of a fishery for a measure 
of fishing intensity by T. M. Widrig. In the third session on Statistics 
in Medicine and Public Health, W. F. Taylor discussed Problems of 
contagion; C. L. Chiang and J. Yerushalmy, Statistical problems in 
medical diagnoses, and J. Cornfield, Some statistical problems arising 
from retrospective studies. 


NOTES 


Cooperative Graduate Summer Sessions in Statistics 


. The University of Florida, North Carolina State College, Virginia 
Polytechnic Institute and the Southern Regional Education Board 
are jointly sponsoring a series of cooperative summer sessions in statistics. 
The first of these cooperative graduate summer sessions was held 
during the summer of 1954 at Virginia Polytechnic Institute. At this 
session there were 89 students from 26 states and the District of 
Columbia and from India, Finland, Canada, Australia, China, Hawaii 
and the Philippines. The following courses were offered: Engineering 
Statistics, Statistical Methods I, Statistical Theory I (Probability and 
Inference), Biostatistics, Quantitative Genetics, Rank Order Statistics, 
Multivariate Analysis, and Seminar on Recent Advances in Statistics. 
Classes ranged in size from 9 to 34, with an average of 20. 

The second session will be held at the University of Florida from 
June 20 to July 29, 1955. A session is scheduled to be held at North 
Carolina State College in 1956, and another at Virginia Polytechnic 
Institute in 1957. 

The summer sessions are designed to carry out a recommendation 
of the Southern Regional Education Board’s Advisory Commission on 
Statistics, on which the three institutions initiating the program are 
represented. The sessions will be of particular interest to (1) research 
and professional workers who want intensive instruction in basic statisti- 
cal concepts and who wish to learn modern statistical methodology; 
(2) teachers of elementary statistical courses who want some formal 
training in modern statistics; (3) prospective candidates for graduate 
degrees in statistics; (4) graduate students in other fields who desire 
supporting work in statistics; and (5) professional statisticians who wish 
to keep informed of advanced specialized theory and methods. 

Fach of the summer sessions will last six weeks and each course will 
carry approximately three semester hours of graduate credit. The 
program may be entered at any session, and consecutive courses will 
follow in successive summers. The summer work in statistics may be 
applied as residence credit at any one of the cooperating institutions, 
as well as certain other institutions, in partial fulfillment of the require- 
ments for a master’s degree. The catalog requirements for the degree 
must be met at the degree-granting institutions. Each doctoral 
candidate should consult with the institution from which he desires to 
obtain the degree regarding the applicability of the summer courses in 
statistics. 


118 


NOTES 119 


The faculty for the 1955 session at the University of Florida will 
include: Professor R. L. Anderson, North Carolina State College; 
Professor D. B. Duncan, University of Florida; Professor Boyd Harsh- 
barger, Virginia Polytechnic Institute: Professor Carl E. Marshall, 
Oklahoma A. and M. College: Professor Herbert A. Meyer, University 
of Florida; Professor George E. Nicholson, Jr., University of North 
Carolina; Professor Phillip J. Rulon, Harvard University; Professor 
Walter L. Smith, University of North Carolina; and Professor Dudley 
E. South, University of Florida. 

Courses to be offered this summer are: Statistical Methods I, 
Statistical Methods II (Design of Experiments), Statistical Theory I, 
Statistical Theory II (Inference and Least Squares), Advanced analysis 
I, Theory of Sampling, Theory of Statistical Inference, Mathematics 
for Statistics, Statistical Research in Education and Psychology and 
Seminar on Recent Advances in Statistics. 

The total tuition fee will be $35 for the six-weeks term. The holder 
of a doctorate degree, upon acceptance, may register without the 
payment of any tuition fee. Living and other expenses at the Uni- 
versity are reasonable. The University is in Gainesville, located in the 
rolling hills of North Central Florida, midway between the cooling 
breezes of the Gulf of Mexico and the Atlantic Ocean. 

Inquiries should be addressed to: 


Proressor Herserrt A. MEYER 
Statistical Laboratory 
University of Florida 
Gainesville, Florida 


Summer Sessions at Berkeley, California 


This year’s program at the Statistical Laboratory of the University 
of California, Berkeley, California, consists of two sessions: June 20- 
July 30 and August 1-September 10, 1955. The faculty of the summer 
sessions will include Professor G. E. Bates of Mt. Holyoke College, 
South Hadley, Massachusetts; Professor J. Neyman, Professor Charles 
H. Kraft and Mr. Howard G. Tucker of the Statistical Laboratory, 
University of California. 

The program includes undergraduate courses primarily meant for 
students transferring from other centers who would like to embark on 
advanced studies in Berkeley during the regular academic year. Pro- 
fessor Neyman will be available for consultations on work leading to 
higher degrees. There will be no graduate course program. However, 
graduate students may be interested in a series of lectures and seminars 


BIOMETRICS, MARCH 1955 


to be given through July and August in connection with the second 
part of the Third Berkeley Symposium on Mathematical Statistics and 
Probability. The scholars who promised to participate in this event 
are: T. W. Anderson, Columbia University, M. 8. Bartlett, University 
of Manchester, J. Berkson, Mayo Clinic, David Blackwell, Howard 
University and University of California, A. J. L. Blanc-Lapierre, 
Université d’Alger, J. Doob, University of Illinois, W. Feller, Princeton 
University, R. Fortét, Institut Henri Poincaré, A. Girshick, Stanford 
University, J. M. Hammersley, Oxford University, J. L. Hodges, Jr., 
University of California, W. Hoeffding, University of North Carolina, 
Lucien LeCam, University of California, Erich L. Lehmann, University 
of California, P. Lévy, l’Ecole Polytechique, H. Robbins, Columbia 
University, Herman Rubin, Stanford University, and C. M. Stein, 
Stanford University. 


Summer Offerings in Statistics at Iowa State College 


The Department of Statistics at Iowa State College will offer a course 
in decision theory at the advanced graduate level during the first half of 
the 1955 summer quarter. The course will be taught by Dr. S. L. 
Isaacson. Members of the graduate faculty in statistics will be available 
during most of the summer for consultation on graduate research 
(Stat. 699) and for special problems courses (Stat. 599). 

Other offerings for the two six-week sessions (June 13-July 20 and 
July 20-August 26) of the summer quarter are designed mainly for the 
graduate minor in statistics and for the beginning graduate major 
in statistics who wish to satisfy prerequisite requirements for more 
advanced courses. These additional offerings include Stat. 401 and 402, 
“Statistical Methods for Research Workers,” offered in sequence; the 
sequence, Stat. 447 and 448, “Statistical Theory for Research Workers;”’ 
Stat. 411, “Experimental Designs for Research Workers;” and Stat. 421, 
“Survey Designs for Research Workers.” Students may register for 
one or both summer sessions. For additional information, write to: 
T. A. Bancroft, Director, The Statistical Laboratory, Iowa State College, 
Ames, Iowa. 


8:30 a.m. 


10:30 a.m. 


2:00 P.M. 


4:00 p.m. 


Joint Meeting of the Institute of Mathematical Statistics 


and The Biometric Society (ENAR) 
Fripay, APRIL 22 
Invited Speakers 
Chairman: Professor H. Fairfield Smith, North Carolina 
State College 

“Life Testing in the Discrete Case’”*—Franklin 8. McFeely 
and John E. Freund, Virginia Polytechnic Institute 
“The Components of Variance and the Correlation Between 
Relatives in Symmetrical Random Mating Populations” — 
Ted Horner, Iowa State College 
“Tests of Hypotheses When the Decision is Based on Several 
Criteria”’* (Preliminary Report)—Irwin Miller and John E. 
Freund, Virginia Polytechnic Institute 
“Power Function of Procedures for Some Components of 
Variance Models’’—Helen Bozivich, Iowa State College 
“Preference Patterns for Decisions on Means’’*—R. Lowell 
Wine and John E. Freund, Virginia Polytechnic Institute 
*Research sponsored by the Office of the Ordnance, U. S. 
Army 


Probability Theory 


Chairman: Dr. Eugene Lukacs, Office of Navy Research 
Speakers: D. Austin—Syracuse University 

J. Blackman—Syracuse University 

Cyrus Derman—Syracuse University 


Multivariate Analysis 
Chairman: Dr. Harold Hotelling, University of North 
Carolina 
Speakers: T. W. Anderson—Columbia University 
W. G. Howe—Oak Ridge Institute of Nuclear 
Studies 
H. C. Sweeny, Virginia Polytechnic Institute 


Contributed Papers 


Chairman: Dr. George E. Nicholson, Jr., University of 
North Carolina 


“Abstracts received prior to March 1, 1955. 


122 BIOMETRICS, MARCH 1955 


Sarurvay, ApRIL 23 


8:30 a.m. Relation Between Smoking and Mortality From Lung 
Cancer 


Chairman: Dr. B. G. Greenberg, University of North 
Carolina 


Speakers: William Haenszel, National Cancer Institute 
Jerome Cornfield, National Institutes of Health 
Joseph Berkson, Mayo Clinic and University of 
Minnesota 


Discussants: Boyd Harshbarger, Virginia Polytechnic 
Institute 
Daniel Horn, American Cancer Society 


10:30 a.m. Contributed Papers* 


Chairman: Dr. R. L. Anderson, North Carolina State College 

1. Information and Distance Applied to Discriminant 
Analysis Between Two Normal Populations—Samuel W. 
Greenhouse, National Institute of Mental Health. 

2. Appropriate Scores in Bio-assays Using Death-Times 
and Survivor Symptoms—Johannes Ipsen, Institute of 
Laboratories and Harvard School of Public Health. 

3. A Comparison of Random and Non-Random Plot Selec- 
tion—Daniel G. Horvitz and Jack Fleischer, North 
Carolina State College. 


*Abstracts received prior to March 1, 1955. 


\ 
. 


